
- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
Python Pandas - Working with HDF5 Format
When working with large datasets, we may get "out of memory" errors. These types of problems can be avoided by using an optimized storage format like HDF5. The pandas library offers tools like the HDFStore class and read/write APIs to easily store, retrieve, and manipulate data while optimizing memory usage and retrieval speed.
HDF5 stands for Hierarchical Data Format version 5, is an open-source file format designed to store large, complex, and heterogeneous data efficiently. It organizes the data in a hierarchical structure similar to a file system, with groups acting like directories and datasets functioning as files. The HDF5 file format can store different types of data (such as arrays, images, tables, and documents) in a hierarchical structure, making it ideal for managing heterogeneous data.
Creating an HDF5 file using HDFStore in Pandas
The HDFStore class in pandas is used to manage HDF5 files in a dictionary-like manner. The HDFStore class is a dictionary-like object that reads and writes Pandas data in the HDF5 format using PyTables library.
Example
Here is an example of demonstrating how to create a HDF5 file in Pandas using the pandas.HDFStore class.
import pandas as pd import numpy as np # Create the store using the HDFStore class store = pd.HDFStore("store.h5") # Display the store print(store) # It is important to close the store after use store.close()
Following is the output of the above code −
<class 'pandas.io.pytables.HDFStore'> File path: store.h5
Note: To work with HDF5 format in pandas, you need the pytables library. It is an optional dependency for pandas and must be installed separately using one of the following commands −
# Using pip pip install tables # or using conda installer conda install pytables
Write/read Data to the HDF5 using HDFStore in Pandas
The HDFStore is a dict-like object, so that we can directly write and read the data to the HDF5 store using key-value pairs.
Example
The below example demonstrates how to write and read data to and from the HDF5 file using the HDFStore in Pandas.
import pandas as pd import numpy as np # Create the store store = pd.HDFStore("store.h5") # Create the data index = pd.date_range("1/1/2024", periods=8) s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"]) df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"]) # Write Pandas data to the Store, which is equivalent to store.put('s', s) store["s"] = s store["df"] = df # Read Data from the store, which is equivalent to store.get('df') from_store = store["df"] print('Retrieved Data From the HDFStore:\n',from_store) # Close the store after use store.close()
Following is the output of the above code −
Retrieved Data From the HDFStore:
A | B | C | |
---|---|---|---|
2024-01-01 | 0.200467 | 0.341899 | 0.105715 |
2024-01-02 | -0.379214 | 1.527714 | 0.186246 |
2024-01-03 | -0.418122 | 1.008820 | 1.331104 |
2024-01-04 | 0.146418 | 0.587433 | -0.750389 |
2024-01-05 | -0.556524 | -0.551443 | -0.161225 |
2024-01-06 | -0.214145 | -0.722693 | 0.072083 |
2024-01-07 | 0.631878 | -0.521474 | -0.769847 |
2024-01-08 | -0.361999 | 0.435252 | 1.177110 |
Read and write HDF5 Format Using Pandas APIs
Pandas also provides high-level APIs to simplify the interaction with HDFStore (Nothing but HDF5 files). These APIs allow you to read and write data directly to and from HDF5 files without needing to manually create an HDFStore object. Following are the primary APIs for handling HDF5 files in pandas −
pandas.read_hdf(): Read data from the HDFStore.
pandas.DataFrame.to_hdf() or pandas.Series.to_hdf(): Write Pandas object data to an HDF5 file using the HDFStore.
Writing Pandas Data to HDF5 Using to_hdf()
The to_hdf() function allows you to write pandas objects such as DataFrames and Series directly to an HDF5 file using the HDFStore. This function provides various optional parameters like compression, handling missing values, format options, and more, allowing you to store your data efficiently.
Example
This example uses the DataFrame.to_hdf() function to write data to the HDF5 file.
import pandas as pd import numpy as np # Create a DataFrame df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]},index=['x', 'y', 'z']) # Write data to an HDF5 file using the to_hdf() df.to_hdf("data_store.h5", key="df", mode="w", format="table") print("Data successfully written to HDF5 file")
Following is the output of the above code −
Data successfully written to HDF5 file
Reading Data from HDF5 Using read_hdf()
The pandas.read_hdf() method is used to retrieve Pandas object stored in an HDF5 file. It accepts the file name, file path or buffer from which data is read.
Example
This example demonstrates how to read data stored under the key "df" from the HDF5 file "data_store.h5" using the pd.read_hdf() method.
import pandas as pd # Read data from the HDF5 file using the read_hdf() retrieved_df = pd.read_hdf("data_store.h5", key="df") # Display the retrieved data print("Retrieved Data:\n", retrieved_df.head())
Following is the output of the above code −
Retrieved Data:
A | B | |
---|---|---|
x | 1 | 4 |
y | 2 | 5 |
z | 3 | 6 |
Appending Data to HDF5 Files Using to_hdf()
Appending data to an existing HDF5 file can be possible by using the mode="a" option of the to_hdf() function. This is useful when you want to add new data to a file without overwriting the existing content.
Example
This example demonstrates how to append data to an an existing HDF5 file using the to_hdf() function.
import pandas as pd import numpy as np # Create a DataFrame to append df_new = pd.DataFrame({'A': [7, 8], 'B': [1, 1]},index=['i', 'j']) # Append the new data to the existing HDF5 file df_new.to_hdf("data_store.h5", key="df", mode="a", format="table", append=True) print("Data successfully appended") # Now read data from the HDF5 file using the read_hdf() retrieved_df = pd.read_hdf("data_store.h5", key='df') # Display the retrieved data print("Retrieved Data:\n", retrieved_df.head())
Following is the output of the above code −
Data successfully appended Retrieved Data:
A | B | |
---|---|---|
x | 1 | 4 |
y | 2 | 5 |
z | 3 | 6 |
i | 7 | 1 |
j | 8 | 1 |