
- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
Python Pandas - Stata Format
STATA is a widely used statistical software for data analysis, visualization, and statistics. It is developed by StataCorp, and this file format has the extension .dta and is used to store datasets. Python's Pandas library provides easy to use functionality for reading from and writing to Stata data files, enabling easy exchanging of data between Pandas and Stata software.
In this tutorial, we will learn how to effectively use the read_stata() and to_stata() methods to work with the Stata file format.
Exporting DataFame to Stata Format
Pandas library provides the DataFrame.to_stata() method for exporting the data in Pandas DataFrame object to a .dta file. By default, the Stata file format version is 115, which corresponds to Stata 12.
Example
Here is a basic example demonstrating how to export data in a Pandas DataFrame object to a Stata format file. In this example we are exporting the DataFrame into a Stata file without index by setting the write_index=False in the DataFrame.to_stata() method.
import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "a": list("abc"), "b": list(range(1, 4)), "c": np.arange(3, 6).astype("u1"), "d": np.arange(4.0, 7.0), "e": [True, False, True], "f": pd.Categorical(list("abc")), "g": pd.date_range("20250101", periods=3) }) print("Original DataFrame:") print(df) # Save the DataFrame to Stata with custom gzip compression df.to_stata("stata_file.dta", write_index=False) print("\nDataFrame has been successfully exported to a Stata file.")
When we run above program, it produces following result −
Original DataFrame:
a | b | c | d | e | f | g | |
---|---|---|---|---|---|---|---|
0 | a | 1 | 3 | 4.0 | True | a | 2025-01-01 |
1 | b | 2 | 4 | 5.0 | False | b | 2025-01-02 |
2 | c | 3 | 5 | 6.0 | True | c | 2025-01-03 |
If you visit the folder where the Stata dataset files are saved, you can observe the generated .dta file.
Data Type Limitations of Stata Format
Stata data files have specific restrictions on data types that can be stored −
Supported data types: int8, int16, int32, float32, float64 and strings with 244 or fewer characters.
Missing values: Floating-point data type missing values are stored as the basic missing data type ("." in Stata). And integer missing values are not supported while exporting to Stata format.
Automatic Data Type Conversion: If a data type exceeds STATA's range, it is cast to a larger type. For example, int8 (values between -127 and 100) is converted to int16 for values outside this range.
Unsupported Data Types: Data types like int64, bool, uint8, uint16, and uint32 are gracefully handled by casting them to the smallest supported type.
Precision Warning: Converting "int64" to "float64" may result in a loss of precision for values larger than 2**53.
String Limitation: Strings longer than 244 characters are not supported. Attempting to write such data will raise a ValueError.
Importing Data from Stata Format
Pandas provides the read_stata() method to import data from .dta files. It returns a Pandas DataFrame or a pandas.api.typing.StataReader object for incremental reading.
Example
The following example demonstrates how to read dataset from a Stata file using the read_stata() method.
import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "a": list("abc"), "b": list(range(1, 4)), "c": np.arange(3, 6).astype("u1"), "d": np.arange(4.0, 7.0), "e": [True, False, True], "f": pd.Categorical(list("abc")), "g": pd.date_range("20250101", periods=3) }) print("Original DataFrame:") print(df) # Save the DataFrame to Stata df.to_stata("stata_file.dta", write_index=False) # Read a Stata file back to DataFrame result = pd.read_stata("stata_file.dta") print("\nDataFrame read from Stata file with custom index:") print(result)
While executing the above code we get the following output −
Original DataFrame:
a | b | c | d | e | f | g | |
---|---|---|---|---|---|---|---|
0 | a | 1 | 3 | 4.0 | True | a | 2025-01-01 |
1 | b | 2 | 4 | 5.0 | False | b | 2025-01-02 |
2 | c | 3 | 5 | 6.0 | True | c | 2025-01-03 |
a | b | c | d | e | f | g | |
---|---|---|---|---|---|---|---|
0 | a | 1 | 3 | 4.0 | 1 | a | 2025-01-01 |
1 | b | 2 | 4 | 5.0 | 0 | b | 2025-01-02 |
2 | c | 3 | 5 | 6.0 | 1 | c | 2025-01-03 |
Reading Data from Stata File in Chunks
For large datasets, you can use the chunksize parameter of the read_stata() method to create a pandas.api.typing.StataReader instance to read the file incrementally. This StataReader object can be used as an iterator.
Example
The following example demonstrates how to use the read_stata() method for reading the Stata data into Pandas DataFrame as an iterator by specifying the value of the chunksize parameter.
import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "a": list("abc"), "b": list(range(1, 4)), "c": np.arange(3, 6).astype("u1"), "d": np.arange(4.0, 7.0), "e": [True, False, True], "f": pd.Categorical(list("abc")), "g": pd.date_range("20250101", periods=3) }) print("Original DataFrame:") print(df) # Save the DataFrame to Stata with custom gzip compression df.to_stata("stata_file.dta", write_index=False) print("\nOutput from a Stata file in chunks:") # Reading Data from Stata File in Chunks with pd.read_stata("stata_file.dta", chunksize=1) as reader: for chunk in reader: print(chunk.shape)
Following is an output of the above code −
Original DataFrame:
a | b | c | d | e | f | g | |
---|---|---|---|---|---|---|---|
0 | a | 1 | 3 | 4.0 | True | a | 2025-01-01 |
1 | b | 2 | 4 | 5.0 | False | b | 2025-01-02 |
2 | c | 3 | 5 | 6.0 | True | c | 2025-01-03 |
Handling Categorical Data in Stata Files
Pandas supports exporting and importing categorical data with value labels in Stata format. Stata supports only string value labels. Non-string categories are converted to strings, potentially resulting in a loss of information.
When importing categorical data from a Stata file into a Pandas DataFrame, it is converted to a Pandas Categorical with integer codes by default. You can use the convert_categoricals=False to import original data values instead of converting them.
Example
This example demonstrates how categorical data in a DataFrame can be saved to a Stata file and read back without converting the categorical values to their string labels.
import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "Col_a": pd.Categorical(list("abcdef")), "Col_b": list("aeioou")}, dtype="category") print("Original DataFrame:") print(df) # Save the DataFrame to Stata with custom gzip compression df.to_stata("stata_file.dta", write_index=False) # Read a Stata file by specifying the column to set it as DataFrame Index result = pd.read_stata("stata_file.dta", convert_categoricals=False) print("\nDataFrame read from Stata file with custom index:") print(result)
Following is an output of the above code −
Original DataFrame:
Col_a | Col_b | |
---|---|---|
0 | a | a |
1 | b | e |
2 | c | i |
3 | d | o |
4 | e | o |
5 | f | u |
Col_a | Col_b | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 3 |
4 | 4 | 3 |
5 | 5 | 4 |