Parsing a text file into a pandas DataFrame

Question

I have a .txt file that looks like this:

SHT1 E: T1:30.45°C H1:59.14 %RH SHT2 S: T2:29.93°C H2:67.38 %RH SHT1 E: T1:30.49°C H1:58.87 %RH SHT2 S: T2:29.94°C H2:67.22 %RH SHT1 E: T1:30.53°C H1:58.69 %RH SHT2 S: T2:29.95°C H2:67.22 %RH

I want to have a DataFrame that looks like this:

 T1 H1 T2 H2 0 30.45 59.14 29.93 67.38 1 30.49 58.87 29.94 67.22 2 30.53 58.69 29.95 67.22

I parse this by:

Reading up the text file line by line
Parsing the lines e.g. matching only the parts with T1, T2, H1, and H2, splitting by :, and removing °C and %RH
The above produces a list of lists each having two items
I flatten the list of lists
Just to chop it up into a list of four-item lists
Dump that to a df
Write to an Excel file

Here's the code:

import itertools import pandas as pd def read_lines(file_object) -> list: return [ parse_line(line) for line in file_object.readlines() if line.strip() ] def parse_line(line: str) -> list: return [ i.split(":")[-1].replace("°C", "").replace("%RH", "") for i in line.strip().split() if i.startswith(("T1", "T2", "H1", "H2")) ] def flatten(parsed_lines: list) -> list: return list(itertools.chain.from_iterable(parsed_lines)) def cut_into_pieces(flattened_lines: list, piece_size: int = 4) -> list: return [ flattened_lines[i:i + piece_size] for i in range(0, len(flattened_lines), piece_size) ] with open("your_text_data.txt") as data: df = pd.DataFrame( cut_into_pieces(flatten(read_lines(data))), columns=["T1", "H1", "T2", "H2"], ) print(df) df.to_excel("your_table.xlsx", index=False)

This works and I get what I want but I feel like points 3, 4, and 5 are a bit of redundant work, especially creating a list of list just to flatten it and then chop it up again.

Question:

How could I simplify the whole parsing process? Or maybe most of the heavy-lifting can be done with pandas alone?

Also, any other feedback is more than welcomed.

riskypenguin · Accepted Answer · 2021-03-26 23:00:55Z

Disclaimer: I know this is a very liberal interpretation of a code review since it suggests an entirely different approach. I still thought it might provide a useful perspective when thinking about such problems in the future and reducing coding effort.

I would suggest the following approach using regex to extract all the numbers that match the format "12.34".

import re import pandas as pd with open("your_text_data.txt") as data_file: data_list = re.findall(r"\d\d\.\d\d", data_file.read()) result = [data_list[i:i + 4] for i in range(0, len(data_list), 4)] df = pd.DataFrame(result, columns=["T1", "H1", "T2", "H2"]) print(df) df.to_excel("your_table.xlsx", index=False)

This will of course only work for the current data format you provided. The code will need to be adjusted if the format of your data changes. For example: If relevant numbers may contain a varying number of digits, you might use the regex "\d+\.\d+" to match all numbers that contain at least one digit on either side of the decimal point.

Also please note the use of the context manager with open(...) as x:. Only code that accesses the object needs to and should be part of the managed context.

I absolutely don't mind that you've offered a new approach. I totally forgot about regex, I was so much into those lists of lists. This is short, simple, and does the job. Nice! Thank you for your time and insight. — baduker, CommentedMar 26, 2021 at 20:36
PS. You've got your imports the other way round. re should be first and then pandas. — baduker, CommentedMar 26, 2021 at 20:37

RootTwo · Accepted Answer · 2021-03-26 23:55:25Z

You can use numpy.loadtxt() to read the data and numpy.reshape() to get the shape you want. The default is to split on whitespace and dtype of float. usecols are the columns we want. conveters is a dict mapping column nos. to functions to convert the column data; here they chop of the unwanted text. The .shape() converts the resulting numpy array from two columns to four columns (the -1 lets numpy calculate the number of rows).

src.seek(0) data = np.loadtxt(src, usecols=(2, 3), converters={2:lambda s:s[3:-2], 3:lambda s:s[3:]} ).reshape(-1, 4)

Then just load it in a dataframe and name the columns:

df = pd.DataFrame(data, columns='T1 H1 T2 H2'.split()) df

Output:

 T1 H1 T2 H2 0 30.45 59.14 29.93 67.38 1 30.49 58.87 29.94 67.22 2 30.53 58.69 29.95 67.22

Stack Exchange Network

Parsing a text file into a pandas DataFrame

Question:

2 Answers 2

Hot Network Questions

Parsing a text file into a pandas DataFrame

Question:

2 Answers 2

Related

Hot Network Questions