I have a .txt
file that looks like this:
SHT1 E: T1:30.45°C H1:59.14 %RH SHT2 S: T2:29.93°C H2:67.38 %RH SHT1 E: T1:30.49°C H1:58.87 %RH SHT2 S: T2:29.94°C H2:67.22 %RH SHT1 E: T1:30.53°C H1:58.69 %RH SHT2 S: T2:29.95°C H2:67.22 %RH
I want to have a DataFrame
that looks like this:
T1 H1 T2 H2 0 30.45 59.14 29.93 67.38 1 30.49 58.87 29.94 67.22 2 30.53 58.69 29.95 67.22
I parse this by:
- Reading up the text file line by line
- Parsing the lines e.g. matching only the parts with
T1, T2, H1, and H2
, splitting by:
, and removing°C
and%RH
- The above produces a list of lists each having two items
- I flatten the list of lists
- Just to chop it up into a list of four-item lists
- Dump that to a
df
- Write to an Excel file
Here's the code:
import itertools import pandas as pd def read_lines(file_object) -> list: return [ parse_line(line) for line in file_object.readlines() if line.strip() ] def parse_line(line: str) -> list: return [ i.split(":")[-1].replace("°C", "").replace("%RH", "") for i in line.strip().split() if i.startswith(("T1", "T2", "H1", "H2")) ] def flatten(parsed_lines: list) -> list: return list(itertools.chain.from_iterable(parsed_lines)) def cut_into_pieces(flattened_lines: list, piece_size: int = 4) -> list: return [ flattened_lines[i:i + piece_size] for i in range(0, len(flattened_lines), piece_size) ] with open("your_text_data.txt") as data: df = pd.DataFrame( cut_into_pieces(flatten(read_lines(data))), columns=["T1", "H1", "T2", "H2"], ) print(df) df.to_excel("your_table.xlsx", index=False)
This works and I get what I want but I feel like points 3, 4, and 5
are a bit of redundant work, especially creating a list of list just to flatten it and then chop it up again.
Question:
How could I simplify the whole parsing process? Or maybe most of the heavy-lifting can be done with pandas
alone?
Also, any other feedback is more than welcomed.