4
$\begingroup$

I'm trying to load a large jsons-file (2.5 GB) into a Pandas dataframe. Due to the large size of the file pandas.read_json() will result in a memory error.

Therefore I'm trying to read it in like this:

S_DIR = r'path-to-directory' with open(os.path.join(S_DIR, 'file.jsons')) as json_file: data = json_file.readlines() data = list(map(json.loads, data)) df = pd.DataFrame(data) 

However, this just keeps running, slowing/crashing my pc.

What would be the most efficient way to do this?

The final aim is to have a subset (sample) of this large file.jsons dataset.

Thanks

$\endgroup$

    1 Answer 1

    6
    $\begingroup$

    There are a few ways to do this a little more efficiently:

    JSON module, then into Pandas

    You could try reading the JSON file directly as a JSON object (i.e. into a Python dictionary) using the json module:

    import json import pandas as pd data = json.load(open("your_file.json", "r")) df = pd.DataFrame.from_dict(data, orient="index") 

    Using orient="index" might be necessary, depending on the shape/mappings of your JSON file.

    check out this in depth tutorial on JSON files with Python.

    Directly using Pandas

    You said this option gives you a memory error, but there is an option that should help with it. Passing lines=True and then specify how many lines to read in one chunk by using the chunksize argument. The following will return an object that you can iterate over, and each iteration will read only 5 lines of the file:

    df = pd.read_json("test.json", orient="records", lines=True, chunksize=5) 

    Note here that the JSON file must be in the records format, meaning each line is list like. This allows Pandas to know that is can reliably read chunksize=5 lines at a time. Here is the relevant documentation on line-delimited JSON files. In short, the file should have be written using something like: df.to_json(..., orient="records", line=True).

    Not only does Pandas abstract some manual parts away for you, it offers a lot more options, such as converting dates correctly, specifying data type of each column and so on. Check out the relevant documentation.

    Check out a little code example in the Pandas user guide documentation.

    Another memory-saving trick - using Generators

    There is a nice way to only have one file's contents in memory at any given time, using Python generators, which have lazy evaluation. Here is a starting place to learn about them.

    In your example, it could look like this:

    import os # Get a list of files files = sorted(os.listdir("your_folder")) # Load each file individually in a generator expression df = pd.concat(pd.read_json(file, orient="index") for f in files, ...) 

    The concatenation happens only once all files are read. Add any more parameters that are required where I left the .... The documentation for pd.concat are here.

    $\endgroup$
    6
    • $\begingroup$I'm not sure I understand the lines=True part. As far as I know, we can only write to json with that argument if the orient is "record". Of course this is about reading, but how could we read lines if we assume that orient is "index" at this moment, while it is possible we've wrote the json just like that before, without any lines=True argument. Are these two same arguments independent from each other, at reading and writing? The documentation truly does not mention any restrictions in the case of reading, only I'm not sure whether this is the reason or not.$\endgroup$
      – MattSom
      CommentedMay 14, 2020 at 12:12
    • $\begingroup$You can indeed only read using lines=True if Pandas is able to parse those. So the JSON file should be of the format produced by e.g. df.to_json(..., orient="records", lines=True). So each line, like you say, needs to be a record, which really just means it must be like a list.$\endgroup$
      – n1k31t4
      CommentedMay 14, 2020 at 12:46
    • $\begingroup$Thank you, yes I just tried to read it with index orient and it gave Memory Error all the same. If this is the case it means df = pd.read_json("test.json", orient="index", lines=True, chunksize=5) should be wrong, shouldn't it?$\endgroup$
      – MattSom
      CommentedMay 14, 2020 at 12:55
    • 1
      $\begingroup$Thanks, I have updated the snippet, added a short explanation and a link to the docs.$\endgroup$
      – n1k31t4
      CommentedMay 14, 2020 at 13:05
    • $\begingroup$I've made a question regarding a solution to this huge oneliner index format. Do you have any suggestion with this?$\endgroup$
      – MattSom
      CommentedMay 14, 2020 at 17:37

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.