3
\$\begingroup\$

I have a .txt file that looks like this:

SHT1 E: T1:30.45°C H1:59.14 %RH SHT2 S: T2:29.93°C H2:67.38 %RH SHT1 E: T1:30.49°C H1:58.87 %RH SHT2 S: T2:29.94°C H2:67.22 %RH SHT1 E: T1:30.53°C H1:58.69 %RH SHT2 S: T2:29.95°C H2:67.22 %RH 

I want to have a DataFrame that looks like this:

 T1 H1 T2 H2 0 30.45 59.14 29.93 67.38 1 30.49 58.87 29.94 67.22 2 30.53 58.69 29.95 67.22 

I parse this by:

  1. Reading up the text file line by line
  2. Parsing the lines e.g. matching only the parts with T1, T2, H1, and H2, splitting by :, and removing °C and %RH
  3. The above produces a list of lists each having two items
  4. I flatten the list of lists
  5. Just to chop it up into a list of four-item lists
  6. Dump that to a df
  7. Write to an Excel file

Here's the code:

import itertools import pandas as pd def read_lines(file_object) -> list: return [ parse_line(line) for line in file_object.readlines() if line.strip() ] def parse_line(line: str) -> list: return [ i.split(":")[-1].replace("°C", "").replace("%RH", "") for i in line.strip().split() if i.startswith(("T1", "T2", "H1", "H2")) ] def flatten(parsed_lines: list) -> list: return list(itertools.chain.from_iterable(parsed_lines)) def cut_into_pieces(flattened_lines: list, piece_size: int = 4) -> list: return [ flattened_lines[i:i + piece_size] for i in range(0, len(flattened_lines), piece_size) ] with open("your_text_data.txt") as data: df = pd.DataFrame( cut_into_pieces(flatten(read_lines(data))), columns=["T1", "H1", "T2", "H2"], ) print(df) df.to_excel("your_table.xlsx", index=False) 

This works and I get what I want but I feel like points 3, 4, and 5 are a bit of redundant work, especially creating a list of list just to flatten it and then chop it up again.


Question:

How could I simplify the whole parsing process? Or maybe most of the heavy-lifting can be done with pandas alone?

Also, any other feedback is more than welcomed.

\$\endgroup\$

    2 Answers 2

    3
    \$\begingroup\$

    Disclaimer: I know this is a very liberal interpretation of a code review since it suggests an entirely different approach. I still thought it might provide a useful perspective when thinking about such problems in the future and reducing coding effort.

    I would suggest the following approach using regex to extract all the numbers that match the format "12.34".

    import re import pandas as pd with open("your_text_data.txt") as data_file: data_list = re.findall(r"\d\d\.\d\d", data_file.read()) result = [data_list[i:i + 4] for i in range(0, len(data_list), 4)] df = pd.DataFrame(result, columns=["T1", "H1", "T2", "H2"]) print(df) df.to_excel("your_table.xlsx", index=False) 

    This will of course only work for the current data format you provided. The code will need to be adjusted if the format of your data changes. For example: If relevant numbers may contain a varying number of digits, you might use the regex "\d+\.\d+" to match all numbers that contain at least one digit on either side of the decimal point.

    Also please note the use of the context manager with open(...) as x:. Only code that accesses the object needs to and should be part of the managed context.

    \$\endgroup\$
    3
    • \$\begingroup\$I absolutely don't mind that you've offered a new approach. I totally forgot about regex, I was so much into those lists of lists. This is short, simple, and does the job. Nice! Thank you for your time and insight.\$\endgroup\$
      – baduker
      CommentedMar 26, 2021 at 20:36
    • \$\begingroup\$PS. You've got your imports the other way round. re should be first and then pandas.\$\endgroup\$
      – baduker
      CommentedMar 26, 2021 at 20:37
    • \$\begingroup\$You're right, I fixed the import order!\$\endgroup\$CommentedMar 26, 2021 at 23:01
    3
    \$\begingroup\$

    You can use numpy.loadtxt() to read the data and numpy.reshape() to get the shape you want. The default is to split on whitespace and dtype of float. usecols are the columns we want. conveters is a dict mapping column nos. to functions to convert the column data; here they chop of the unwanted text. The .shape() converts the resulting numpy array from two columns to four columns (the -1 lets numpy calculate the number of rows).

    src.seek(0) data = np.loadtxt(src, usecols=(2, 3), converters={2:lambda s:s[3:-2], 3:lambda s:s[3:]} ).reshape(-1, 4) 

    Then just load it in a dataframe and name the columns:

    df = pd.DataFrame(data, columns='T1 H1 T2 H2'.split()) df 

    Output:

     T1 H1 T2 H2 0 30.45 59.14 29.93 67.38 1 30.49 58.87 29.94 67.22 2 30.53 58.69 29.95 67.22 
    \$\endgroup\$

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.