Python Pandas - ORC Format



ORC Format in Pandas

The ORC (Optimized Row Columnar) format in Pandas is binary columnar file format designed to store and read DataFrames efficiently. It supports fast and compact columnar data storage and shares data across different analysis languages. Similar to the Parquet Format, ORC is a standardized open-source columnar storage format, enabling efficient reading and writing of DataFrames and easy data sharing across various languages.

Python's Pandas library supports working with ORC file format through, the read_orc() and to_orc() methods, enabling both the reading and writing of data in ORC format. In this tutorial, we will learn how to work with ORC format using Python's Pandas library.

Important Considerations

When working with ORC format in Pandas, you need to consider the following key points in mind −

  • Installation Requirements: This format requires the pyarrow library for both reading and writing ORC file formats in Pandas.

  • Timezone handling: Timezones in datetime columns are not preserved when saving DataFrames in ORC format.

  • Platform support: This method is not supported on Windows operating system as of now.

Saving a Pandas DataFrame to an ORC File

To save a Pandas DataFrame to an ORC file, you can use the DataFrame.to_orc() method, which saves data of the Pandas DataFrame to a file in ORC format.

Note: Before saving or retrieving the data from an ORC file, you need to install the 'pyarrow' library. It is highly recommended to install it using the conda installer to avoid compatibility issues. Use the following command −
 conda install pyarrow. 

Example

This example shows how to save DataFrames to ORC format using the DataFrame.to_ORC() method.

 import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "Col1": list("abc"), "Col2": list(range(1, 4)), "Col3": np.arange(4.0, 7.0), "Col4": [True, False, True], "Col5": pd.date_range("20240101", periods=3) }) print("Original DataFrame:") print(df) # Save the DataFrame as an ORC file df.to_orc("df_orc_file.orc", engine="pyarrow") print("\nDataFrame is successfully saved as an ORC file.") 

When we run above program, it produces following result −

 Original DataFrame: 
Col1Col2Col3Col4Col5
0a14.0True2024-01-01
1b25.0False2024-01-02
2c36.0True2024-01-01
DataFrame is successfully saved as an ORC file.
If you visit the folder where the ORC files are saved, you can observe the generated ORC file.

Reading Data from an ORC File

The Pandas read_orc() method reads data from an ORC file and loads it into a Pandas DataFrame. This method allows reading ORC files from a variety of storage backends, including local files, URLs, and cloud storage services.

Example

This example reads the Pandas DataFrame from an ORC file using the Pandas read_orc() method.

 import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "Col1": list("abc"), "Col2": list(range(1, 4)), "Col3": np.arange(4.0, 7.0), "Col4": [True, False, True], "Col5": pd.date_range("20240101", periods=3) }) # Save the DataFrame as an ORC file df.to_orc("df_orc_file.orc", engine="pyarrow") # Load the ORC file result = pd.read_orc("df_orc_file.orc") # Display the DataFrame print('Loaded DataFrame:') print(result) # Verify data types print("\nData Type of the each column:") print(result.dtypes) 

While executing the above code we get the following output −

 Loaded DataFrame: 
Col1Col2Col3Col4Col5
0a14.0True2024-01-01
1b25.0False2024-01-02
2c36.0True2024-01-01
Data Type of the each column: Col1 object Col2 int64 Col3 float64 Col4 bool Col5 datetime64[ns] dtype: object

Reading Specific Columns from an ORC File

You can specify which columns to load from the ORC file using the columns parameter of the read_orc() method.

Example

This example shows how to load specific columns data from an ORC format using the columns parameter of the read_orc() method. In this example we will load the "Col1", "Col4", "Col5" data only.

 import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "Col1": list("abc"), "Col2": list(range(1, 4)), "Col3": np.arange(4.0, 7.0), "Col4": [True, False, True], "Col5": pd.date_range("20240101", periods=3) }) print("Original DataFrame:") print(df) # Save the DataFrame as an ORC file df.to_orc("df_orc_file.orc", engine="pyarrow") # Load the ORC file # Read only a specific column result = pd.read_orc("df_orc_file.orc", columns=["Col1", "Col4", "Col5"]) print("\nDataFrame with Selected Columns:") print(result) 

When we run above program, it produces following result −

 Original DataFrame: 
Col1Col2Col3Col4Col5
0a14.0True2024-01-01
1b25.0False2024-01-02
2c36.0True2024-01-01
DataFrame with Selected Columns:
Col1Col4Col5
0aTrue2024-01-01
1bFalse2024-01-02
2cTrue2024-01-01

Using In-Memory Buffers for ORC Files

ORC files can be created and read directly from memory buffers for scenarios where disk I/O operations are not desirable. In-memory files or memory buffers in Python stores the data in RAM rather than reading/writing to a disk, which is useful for temporary data processing, avoiding file I/O operations.

Example

This example demonstrates reading and writing a DataFrame as an ORC format In-Memory using the read_orc() and DataFrame.to_orc() methods with the help of the BytesIO library.

 import pandas as pd import io # Create a DataFrame df = pd.DataFrame({"Col_1": range(5), "Col_2": range(5, 10)}) print("Original DataFrame:") print(df) # Save the DataFrame to an in-memory buffer buffer = io.BytesIO() df.to_orc(buffer) # Read the ORC file from the buffer loaded_df = pd.read_orc(buffer) print("\nDataFrame loaded from In-Memory ORC file:") print(loaded_df) 

Following is an output of the above code −

 Original DataFrame: 
Col_1Col_2
005
116
227
338
449
DataFrame Loaded from In-Memory ORC file:
Col_1Col_2
005
116
227
338
449

Handling Indexes in ORC Files

ORC format does not support serializing a non-default indexes and will raise a ValueError. To prevent the index data, reset the index and save it as a column by using the reset_index() method, this will included index column in the output file.

Example

This example saves the custom index data as a column in DataFrame while saving it into an ORC file using the reset_index() method.

 import pandas as pd # Create a DataFrame with an index df = pd.DataFrame({"Col1": [1, 2, 3], "Col2": ["a", "b", "c"]}, index=["r1", "r2", "r3"]) print("Original DataFrame:") print(df) # Save the DataFrame with its index as a column df.reset_index().to_orc("data_no_index.orc") # Read the file output = pd.read_orc("data_no_index.orc") print("\nLoaded DataFrame with custom index as a column:") print(output) 

When we run above program, it produces following result −

 Original DataFrame: 
Col1Col2
r11a
r22b
r33c
Loaded DataFrame with custom index as a column:
0r11a
1r22b
2r33c
Advertisements
close