Python Pandas - Home
Python Pandas - Introduction
Python Pandas - Environment Setup
Python Pandas - Basics
Python Pandas - Introduction to Data Structures
Python Pandas - Index Objects
Python Pandas - Panel
Python Pandas - Basic Functionality
Python Pandas - Indexing & Selecting Data
Python Pandas - Series
Python Pandas - Series
Python Pandas - Slicing a Series Object
Python Pandas - Attributes of a Series Object
Python Pandas - Arithmetic Operations on Series Object
Python Pandas - Converting Series to Other Objects
Python Pandas - DataFrame
Python Pandas - DataFrame
Python Pandas - Accessing DataFrame
Python Pandas - Slicing a DataFrame Object
Python Pandas - Modifying DataFrame
Python Pandas - Removing Rows from a DataFrame
Python Pandas - Arithmetic Operations on DataFrame
Python Pandas - IO Tools
Python Pandas - IO Tools
Python Pandas - Working with CSV Format
Python Pandas - Reading & Writing JSON Files
Python Pandas - Reading Data from an Excel File
Python Pandas - Writing Data to Excel Files
Python Pandas - Working with HTML Data
Python Pandas - Clipboard
Python Pandas - Working with HDF5 Format
Python Pandas - Comparison with SQL
Python Pandas - Data Handling
Python Pandas - Sorting
Python Pandas - Reindexing
Python Pandas - Iteration
Python Pandas - Concatenation
Python Pandas - Statistical Functions
Python Pandas - Descriptive Statistics
Python Pandas - Working with Text Data
Python Pandas - Function Application
Python Pandas - Options & Customization
Python Pandas - Window Functions
Python Pandas - Aggregations
Python Pandas - Merging/Joining
Python Pandas - MultiIndex
Python Pandas - Basics of MultiIndex
Python Pandas - Indexing with MultiIndex
Python Pandas - Advanced Reindexing with MultiIndex
Python Pandas - Renaming MultiIndex Labels
Python Pandas - Sorting a MultiIndex
Python Pandas - Binary Operations
Python Pandas - Binary Comparison Operations
Python Pandas - Boolean Indexing
Python Pandas - Boolean Masking
Python Pandas - Data Reshaping & Pivoting
Python Pandas - Pivoting
Python Pandas - Stacking & Unstacking
Python Pandas - Melting
Python Pandas - Computing Dummy Variables
Python Pandas - Categorical Data
Python Pandas - Categorical Data
Python Pandas - Ordering & Sorting Categorical Data
Python Pandas - Comparing Categorical Data
Python Pandas - Handling Missing Data
Python Pandas - Missing Data
Python Pandas - Filling Missing Data
Python Pandas - Interpolation of Missing Values
Python Pandas - Dropping Missing Data
Python Pandas - Calculations with Missing Data
Python Pandas - Handling Duplicates
Python Pandas - Duplicated Data
Python Pandas - Counting & Retrieving Unique Elements
Python Pandas - Duplicated Labels
Python Pandas - Grouping & Aggregation
Python Pandas - GroupBy
Python Pandas - Time-series Data
Python Pandas - Date Functionality
Python Pandas - Timedelta
Python Pandas - Sparse Data Structures
Python Pandas - Sparse Data
Python Pandas - Visualization
Python Pandas - Visualization
Python Pandas - Additional Concepts
Python Pandas - Caveats & Gotchas

Python Pandas - ORC Format

ORC Format in Pandas

The ORC (Optimized Row Columnar) format in Pandas is binary columnar file format designed to store and read DataFrames efficiently. It supports fast and compact columnar data storage and shares data across different analysis languages. Similar to the Parquet Format, ORC is a standardized open-source columnar storage format, enabling efficient reading and writing of DataFrames and easy data sharing across various languages.

Python's Pandas library supports working with ORC file format through, the read_orc() and to_orc() methods, enabling both the reading and writing of data in ORC format. In this tutorial, we will learn how to work with ORC format using Python's Pandas library.

Important Considerations

When working with ORC format in Pandas, you need to consider the following key points in mind −

Installation Requirements: This format requires the pyarrow library for both reading and writing ORC file formats in Pandas.
Timezone handling: Timezones in datetime columns are not preserved when saving DataFrames in ORC format.
Platform support: This method is not supported on Windows operating system as of now.

Saving a Pandas DataFrame to an ORC File

To save a Pandas DataFrame to an ORC file, you can use the DataFrame.to_orc() method, which saves data of the Pandas DataFrame to a file in ORC format.

Note: Before saving or retrieving the data from an ORC file, you need to install the 'pyarrow' library. It is highly recommended to install it using the conda installer to avoid compatibility issues. Use the following command −
 conda install pyarrow. 

Example

This example shows how to save DataFrames to ORC format using the DataFrame.to_ORC() method.

 import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "Col1": list("abc"), "Col2": list(range(1, 4)), "Col3": np.arange(4.0, 7.0), "Col4": [True, False, True], "Col5": pd.date_range("20240101", periods=3) }) print("Original DataFrame:") print(df) # Save the DataFrame as an ORC file df.to_orc("df_orc_file.orc", engine="pyarrow") print("\nDataFrame is successfully saved as an ORC file.")

When we run above program, it produces following result −

 Original DataFrame:

	Col1	Col2	Col3	Col4	Col5
0	a	1	4.0	True	2024-01-01
1	b	2	5.0	False	2024-01-02
2	c	3	6.0	True	2024-01-01

DataFrame is successfully saved as an ORC file.

If you visit the folder where the ORC files are saved, you can observe the generated ORC file.

Reading Data from an ORC File

The Pandas read_orc() method reads data from an ORC file and loads it into a Pandas DataFrame. This method allows reading ORC files from a variety of storage backends, including local files, URLs, and cloud storage services.

Example

This example reads the Pandas DataFrame from an ORC file using the Pandas read_orc() method.

 import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "Col1": list("abc"), "Col2": list(range(1, 4)), "Col3": np.arange(4.0, 7.0), "Col4": [True, False, True], "Col5": pd.date_range("20240101", periods=3) }) # Save the DataFrame as an ORC file df.to_orc("df_orc_file.orc", engine="pyarrow") # Load the ORC file result = pd.read_orc("df_orc_file.orc") # Display the DataFrame print('Loaded DataFrame:') print(result) # Verify data types print("\nData Type of the each column:") print(result.dtypes)

While executing the above code we get the following output −

 Loaded DataFrame:

	Col1	Col2	Col3	Col4	Col5
0	a	1	4.0	True	2024-01-01
1	b	2	5.0	False	2024-01-02
2	c	3	6.0	True	2024-01-01

Data Type of the each column: Col1 object Col2 int64 Col3 float64 Col4 bool Col5 datetime64[ns] dtype: object

Reading Specific Columns from an ORC File

You can specify which columns to load from the ORC file using the columns parameter of the read_orc() method.

Example

This example shows how to load specific columns data from an ORC format using the columns parameter of the read_orc() method. In this example we will load the "Col1", "Col4", "Col5" data only.

 import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "Col1": list("abc"), "Col2": list(range(1, 4)), "Col3": np.arange(4.0, 7.0), "Col4": [True, False, True], "Col5": pd.date_range("20240101", periods=3) }) print("Original DataFrame:") print(df) # Save the DataFrame as an ORC file df.to_orc("df_orc_file.orc", engine="pyarrow") # Load the ORC file # Read only a specific column result = pd.read_orc("df_orc_file.orc", columns=["Col1", "Col4", "Col5"]) print("\nDataFrame with Selected Columns:") print(result)

When we run above program, it produces following result −

 Original DataFrame:

	Col1	Col2	Col3	Col4	Col5
0	a	1	4.0	True	2024-01-01
1	b	2	5.0	False	2024-01-02
2	c	3	6.0	True	2024-01-01

DataFrame with Selected Columns:

	Col1	Col4	Col5
0	a	True	2024-01-01
1	b	False	2024-01-02
2	c	True	2024-01-01

Using In-Memory Buffers for ORC Files

ORC files can be created and read directly from memory buffers for scenarios where disk I/O operations are not desirable. In-memory files or memory buffers in Python stores the data in RAM rather than reading/writing to a disk, which is useful for temporary data processing, avoiding file I/O operations.

Example

This example demonstrates reading and writing a DataFrame as an ORC format In-Memory using the read_orc() and DataFrame.to_orc() methods with the help of the BytesIO library.

 import pandas as pd import io # Create a DataFrame df = pd.DataFrame({"Col_1": range(5), "Col_2": range(5, 10)}) print("Original DataFrame:") print(df) # Save the DataFrame to an in-memory buffer buffer = io.BytesIO() df.to_orc(buffer) # Read the ORC file from the buffer loaded_df = pd.read_orc(buffer) print("\nDataFrame loaded from In-Memory ORC file:") print(loaded_df)

Following is an output of the above code −

 Original DataFrame:

	Col_1	Col_2
0	0	5
1	1	6
2	2	7
3	3	8
4	4	9

DataFrame Loaded from In-Memory ORC file:

	Col_1	Col_2
0	0	5
1	1	6
2	2	7
3	3	8
4	4	9

Handling Indexes in ORC Files

ORC format does not support serializing a non-default indexes and will raise a ValueError. To prevent the index data, reset the index and save it as a column by using the reset_index() method, this will included index column in the output file.

Example

This example saves the custom index data as a column in DataFrame while saving it into an ORC file using the reset_index() method.

 import pandas as pd # Create a DataFrame with an index df = pd.DataFrame({"Col1": [1, 2, 3], "Col2": ["a", "b", "c"]}, index=["r1", "r2", "r3"]) print("Original DataFrame:") print(df) # Save the DataFrame with its index as a column df.reset_index().to_orc("data_no_index.orc") # Read the file output = pd.read_orc("data_no_index.orc") print("\nLoaded DataFrame with custom index as a column:") print(output)

When we run above program, it produces following result −

 Original DataFrame:

	Col1	Col2
r1	1	a
r2	2	b
r3	3	c

Loaded DataFrame with custom index as a column:

0	r1	1	a
1	r2	2	b
2	r3	3	c

Print Page