How to Better Parse CSV file with Pandas

Question

I am working on a CSV data Sheet and want to parse and filter the data out it. While working on a code, I found a similar code someone has asked on SO POST and the author having almost the same H/W data as I see that related to HPE H/W where I have some data and columns are different.

I want to know how we can define this code in a better and performant way. Any help will be much appreciated.

Sample Data:

Status Server Server Name Bay # Model Processor Proc. Count Memory Serial Number State Power State iLO FW Firmware Appliance Name Critical enc2010, bay 1 tdm2066.example.com 1 ProLiant BL460c Gen9 Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz 2 262144 2M272101N9 Unmanaged On 2.53 May 03 2017 I36 v2.40 (02/17/2017) OV C7000 enclosures 1 OK enc1011, bay 1 tdm1068.example.com 1 ProLiant BL460c Gen9 Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz 2 262144 2M272101P6 Monitored On 2.55 Aug 16 2017 I36 v2.74 (07/21/2019) OV C7000 enclosures 1 OK enc1012, bay 1 tdm1083.example.com 1 ProLiant BL460c Gen9 Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz 2 262144 2M272101NX Monitored On 2.61 Jul 27 2018 I36 v2.60 (05/21/2018) OV C7000 enclosures 1 OK ENC2004, bay 1 tdm2033.example.com 1 ProLiant BL460c Gen9 Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz 2 524288 2M262602L2 Monitored On 2.55 Aug 16 2017 I36 v2.52 (10/25/2017) OV C7000 enclosures 1 OK ENC2006, bay 1 vds2009 1 ProLiant BL460c Gen9 Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz 2 524288 2M263604ZZ Monitored On 2.40 Dec 02 2015 I36 v2.20 (05/05/2016) OV C7000 enclosures 1 OK ENC2011, bay 1 tdm2081.example.com 1 ProLiant BL460c Gen9 Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz 2 524288 2M2708027Z Monitored On 2.55 Aug 16 2017 I36 v2.52 (10/25/2017) OV C7000 enclosures 1 OK ENC1003, bay 1 tdm1024.example.com 1 ProLiant BL460c Gen9 Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz 2 524288 2M262602KW Monitored On 2.73 Feb 11 2020 I36 v2.52 (10/25/2017) OV C7000 enclosures 1 OK ENC1006, bay 1 vds1009 1 ProLiant BL460c Gen9 Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz 2 524288 2M262505V5 Monitored On 2.40 Dec 02 2015 I36 v2.00 (12/28/2015) OV C7000 enclosures 1 OK ENC1007, bay 1 vds1023 1 ProLiant BL460c Gen9 Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz 2 524288 2M264800TR Monitored On 2.50 Sep 23 2016 I36 v2.30 (09/12/2016) OV C7000 enclosures 1

How Data-Frame Looks Like:

 Server Name Appliance Name Bay Enclosure 0 tdm2066 OV C7000 enclosures 1 bay 1 ENC2010 1 tdm1068 OV C7000 enclosures 1 bay 1 ENC1011 2 tdm1083 OV C7000 enclosures 1 bay 1 ENC1012 3 tdm2033 OV C7000 enclosures 1 bay 1 ENC2004 4 vds2009 OV C7000 enclosures 1 bay 1 ENC2006 5 tdm2081 OV C7000 enclosures 1 bay 1 ENC2011 6 tdm1024 OV C7000 enclosures 1 bay 1 ENC1003 7 vds1009 OV C7000 enclosures 1 bay 1 ENC1006 8 vds1023 OV C7000 enclosures 1 bay 1 ENC1007 9 vds0003 OV C7000 enclosures 1 bay 1 ENT0003 10 tdm7123 OV C7000 enclosures 1 bay 1 ENC7003 --------------- stripped lines ----------------------

Code:

df = pd.read_csv("testcreate.csv", sep="\t") df = df[[ 'Server', 'Server Name', 'Bay #', 'Appliance Name']] df['Bay'] = df['Server'].str.split(',').str[1].str.lower() df['Enclosure'] = df['Server'].str.split(',').str[0].str.upper() df['Server Name'] = df['Server Name'].str.split('.').str[0] df = df.drop(['Server', 'Bay #'], axis=1) df = df[df['Appliance Name'].str.contains('C7000')] df = pd.concat([g.set_index('Bay')['Server Name'].rename(f'{n}') for n, g in df.groupby('Enclosure')], axis=1, sort=False)

The above code works for me and outputs the below result but looking t learn the right way of writing from the experts.

Returned Sample Result: This works.

 ENC1002 ENC1003 ENC1005 bay 1 tdm1012 tdm1024 vds1001

Peilonrayz · Accepted Answer · 2025-01-25 21:10:49Z

As @KarnKumar said in the comments, you should only load the columns you care about during read_csv. By using usecols.

The filename itself is a problem, because it isn't a CSV; it's a TSV.

The repeated split()-and-index calls aren't the worst, but I prefer regex parsing with named captures. Pandas has good support for this.

You should do your C7000 filter sooner, so that there's less data to process later.

The output format is bizarre and unhelpful. Perhaps this is an X/Y problem and there's a specific reason it looks that way, but in the absence of any justification, just output a two-column dataframe with a server name index:

import pandas as pd df = pd.read_csv( 'testcreate.tsv', sep='\t', usecols=('Server', 'Server Name', 'Appliance Name'), ) df = df[df['Appliance Name'].str.contains('C7000')] server = df['Server Name'].str.extract( r'''(?x) ^ # start (?P<Server> # named server capture [^.]+ # non-dots (first DNS portion), greedy ) ''', expand=False, ) enclosure_bay = df['Server'].str.extract( r'''(?x) ^ # start (?P<Enclosure> # named capture [^,]+ # non-commas ) ,\ * # comma, optional spaces (?P<Bay> # named capture [^,]+? # non-commas, lazy ) $ # end ''' ) out = pd.DataFrame( index=server, data={ 'Enclosure': enclosure_bay['Enclosure'].str.upper().values, 'Bay': enclosure_bay['Bay'].str.lower().values, }, ) print(out) ''' Original output: Bay,ENC1003,ENC1006,ENC1007,ENC1011,ENC1012,ENC2004,ENC2006,ENC2010,ENC2011 bay 1,tdm1024,vds1009,vds1023,tdm1068,tdm1083,tdm2033,vds2009,tdm2066,tdm2081 New output: Enclosure Bay Server tdm2066 ENC2010 bay 1 tdm1068 ENC1011 bay 1 tdm1083 ENC1012 bay 1 tdm2033 ENC2004 bay 1 vds2009 ENC2006 bay 1 tdm2081 ENC2011 bay 1 tdm1024 ENC1003 bay 1 vds1009 ENC1006 bay 1 vds1023 ENC1007 bay 1 '''

Stack Exchange Network

How to Better Parse CSV file with Pandas

Sample Data:

How Data-Frame Looks Like:

Code:

Returned Sample Result: This works.

1 Answer 1

Hot Network Questions

How to Better Parse CSV file with Pandas

Sample Data:

How Data-Frame Looks Like:

Code:

Returned Sample Result: This works.

1 Answer 1

Related

Hot Network Questions