There are a few ways to do this a little more efficiently:
JSON module, then into Pandas
You could try reading the JSON file directly as a JSON object (i.e. into a Python dictionary) using the json module:
import json import pandas as pd data = json.load(open("your_file.json", "r")) df = pd.DataFrame.from_dict(data, orient="index")
Using orient="index"
might be necessary, depending on the shape/mappings of your JSON file.
check out this in depth tutorial on JSON files with Python.
Directly using Pandas
You said this option gives you a memory error, but there is an option that should help with it. Passing lines=True
and then specify how many lines to read in one chunk by using the chunksize
argument. The following will return an object that you can iterate over, and each iteration will read only 5 lines of the file:
df = pd.read_json("test.json", orient="records", lines=True, chunksize=5)
Note here that the JSON file must be in the records
format, meaning each line is list like. This allows Pandas to know that is can reliably read chunksize=5
lines at a time. Here is the relevant documentation on line-delimited JSON files. In short, the file should have be written using something like: df.to_json(..., orient="records", line=True)
.
Not only does Pandas abstract some manual parts away for you, it offers a lot more options, such as converting dates correctly, specifying data type of each column and so on. Check out the relevant documentation.
Check out a little code example in the Pandas user guide documentation.
Another memory-saving trick - using Generators
There is a nice way to only have one file's contents in memory at any given time, using Python generators, which have lazy evaluation. Here is a starting place to learn about them.
In your example, it could look like this:
import os # Get a list of files files = sorted(os.listdir("your_folder")) # Load each file individually in a generator expression df = pd.concat(pd.read_json(file, orient="index") for f in files, ...)
The concatenation happens only once all files are read. Add any more parameters that are required where I left the ...
. The documentation for pd.concat
are here.