
- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
Python Pandas - ORC Format
ORC Format in Pandas
The ORC (Optimized Row Columnar) format in Pandas is binary columnar file format designed to store and read DataFrames efficiently. It supports fast and compact columnar data storage and shares data across different analysis languages. Similar to the Parquet Format, ORC is a standardized open-source columnar storage format, enabling efficient reading and writing of DataFrames and easy data sharing across various languages.
Python's Pandas library supports working with ORC file format through, the read_orc() and to_orc() methods, enabling both the reading and writing of data in ORC format. In this tutorial, we will learn how to work with ORC format using Python's Pandas library.
Important Considerations
When working with ORC format in Pandas, you need to consider the following key points in mind −
Installation Requirements: This format requires the pyarrow library for both reading and writing ORC file formats in Pandas.
Timezone handling: Timezones in datetime columns are not preserved when saving DataFrames in ORC format.
Platform support: This method is not supported on Windows operating system as of now.
Saving a Pandas DataFrame to an ORC File
To save a Pandas DataFrame to an ORC file, you can use the DataFrame.to_orc() method, which saves data of the Pandas DataFrame to a file in ORC format.
Note: Before saving or retrieving the data from an ORC file, you need to install the 'pyarrow' library. It is highly recommended to install it using the conda installer to avoid compatibility issues. Use the following command −conda install pyarrow.
Example
This example shows how to save DataFrames to ORC format using the DataFrame.to_ORC() method.
import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "Col1": list("abc"), "Col2": list(range(1, 4)), "Col3": np.arange(4.0, 7.0), "Col4": [True, False, True], "Col5": pd.date_range("20240101", periods=3) }) print("Original DataFrame:") print(df) # Save the DataFrame as an ORC file df.to_orc("df_orc_file.orc", engine="pyarrow") print("\nDataFrame is successfully saved as an ORC file.")
When we run above program, it produces following result −
Original DataFrame:
Col1 | Col2 | Col3 | Col4 | Col5 | |
---|---|---|---|---|---|
0 | a | 1 | 4.0 | True | 2024-01-01 |
1 | b | 2 | 5.0 | False | 2024-01-02 |
2 | c | 3 | 6.0 | True | 2024-01-01 |
If you visit the folder where the ORC files are saved, you can observe the generated ORC file.
Reading Data from an ORC File
The Pandas read_orc() method reads data from an ORC file and loads it into a Pandas DataFrame. This method allows reading ORC files from a variety of storage backends, including local files, URLs, and cloud storage services.
Example
This example reads the Pandas DataFrame from an ORC file using the Pandas read_orc() method.
import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "Col1": list("abc"), "Col2": list(range(1, 4)), "Col3": np.arange(4.0, 7.0), "Col4": [True, False, True], "Col5": pd.date_range("20240101", periods=3) }) # Save the DataFrame as an ORC file df.to_orc("df_orc_file.orc", engine="pyarrow") # Load the ORC file result = pd.read_orc("df_orc_file.orc") # Display the DataFrame print('Loaded DataFrame:') print(result) # Verify data types print("\nData Type of the each column:") print(result.dtypes)
While executing the above code we get the following output −
Loaded DataFrame:
Col1 | Col2 | Col3 | Col4 | Col5 | |
---|---|---|---|---|---|
0 | a | 1 | 4.0 | True | 2024-01-01 |
1 | b | 2 | 5.0 | False | 2024-01-02 |
2 | c | 3 | 6.0 | True | 2024-01-01 |
Reading Specific Columns from an ORC File
You can specify which columns to load from the ORC file using the columns parameter of the read_orc() method.
Example
This example shows how to load specific columns data from an ORC format using the columns parameter of the read_orc() method. In this example we will load the "Col1", "Col4", "Col5" data only.
import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "Col1": list("abc"), "Col2": list(range(1, 4)), "Col3": np.arange(4.0, 7.0), "Col4": [True, False, True], "Col5": pd.date_range("20240101", periods=3) }) print("Original DataFrame:") print(df) # Save the DataFrame as an ORC file df.to_orc("df_orc_file.orc", engine="pyarrow") # Load the ORC file # Read only a specific column result = pd.read_orc("df_orc_file.orc", columns=["Col1", "Col4", "Col5"]) print("\nDataFrame with Selected Columns:") print(result)
When we run above program, it produces following result −
Original DataFrame:
Col1 | Col2 | Col3 | Col4 | Col5 | |
---|---|---|---|---|---|
0 | a | 1 | 4.0 | True | 2024-01-01 |
1 | b | 2 | 5.0 | False | 2024-01-02 |
2 | c | 3 | 6.0 | True | 2024-01-01 |
Col1 | Col4 | Col5 | |
---|---|---|---|
0 | a | True | 2024-01-01 |
1 | b | False | 2024-01-02 |
2 | c | True | 2024-01-01 |
Using In-Memory Buffers for ORC Files
ORC files can be created and read directly from memory buffers for scenarios where disk I/O operations are not desirable. In-memory files or memory buffers in Python stores the data in RAM rather than reading/writing to a disk, which is useful for temporary data processing, avoiding file I/O operations.
Example
This example demonstrates reading and writing a DataFrame as an ORC format In-Memory using the read_orc() and DataFrame.to_orc() methods with the help of the BytesIO library.
import pandas as pd import io # Create a DataFrame df = pd.DataFrame({"Col_1": range(5), "Col_2": range(5, 10)}) print("Original DataFrame:") print(df) # Save the DataFrame to an in-memory buffer buffer = io.BytesIO() df.to_orc(buffer) # Read the ORC file from the buffer loaded_df = pd.read_orc(buffer) print("\nDataFrame loaded from In-Memory ORC file:") print(loaded_df)
Following is an output of the above code −
Original DataFrame:
Col_1 | Col_2 | |
---|---|---|
0 | 0 | 5 |
1 | 1 | 6 |
2 | 2 | 7 |
3 | 3 | 8 |
4 | 4 | 9 |
Col_1 | Col_2 | |
---|---|---|
0 | 0 | 5 |
1 | 1 | 6 |
2 | 2 | 7 |
3 | 3 | 8 |
4 | 4 | 9 |
Handling Indexes in ORC Files
ORC format does not support serializing a non-default indexes and will raise a ValueError. To prevent the index data, reset the index and save it as a column by using the reset_index() method, this will included index column in the output file.
Example
This example saves the custom index data as a column in DataFrame while saving it into an ORC file using the reset_index() method.
import pandas as pd # Create a DataFrame with an index df = pd.DataFrame({"Col1": [1, 2, 3], "Col2": ["a", "b", "c"]}, index=["r1", "r2", "r3"]) print("Original DataFrame:") print(df) # Save the DataFrame with its index as a column df.reset_index().to_orc("data_no_index.orc") # Read the file output = pd.read_orc("data_no_index.orc") print("\nLoaded DataFrame with custom index as a column:") print(output)
When we run above program, it produces following result −
Original DataFrame:
Col1 | Col2 | |
---|---|---|
r1 | 1 | a |
r2 | 2 | b |
r3 | 3 | c |
0 | r1 | 1 | a |
1 | r2 | 2 | b |
2 | r3 | 3 | c |