Python Pandas - Stata Format



STATA is a widely used statistical software for data analysis, visualization, and statistics. It is developed by StataCorp, and this file format has the extension .dta and is used to store datasets. Python's Pandas library provides easy to use functionality for reading from and writing to Stata data files, enabling easy exchanging of data between Pandas and Stata software.

In this tutorial, we will learn how to effectively use the read_stata() and to_stata() methods to work with the Stata file format.

Exporting DataFame to Stata Format

Pandas library provides the DataFrame.to_stata() method for exporting the data in Pandas DataFrame object to a .dta file. By default, the Stata file format version is 115, which corresponds to Stata 12.

Example

Here is a basic example demonstrating how to export data in a Pandas DataFrame object to a Stata format file. In this example we are exporting the DataFrame into a Stata file without index by setting the write_index=False in the DataFrame.to_stata() method.

 import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "a": list("abc"), "b": list(range(1, 4)), "c": np.arange(3, 6).astype("u1"), "d": np.arange(4.0, 7.0), "e": [True, False, True], "f": pd.Categorical(list("abc")), "g": pd.date_range("20250101", periods=3) }) print("Original DataFrame:") print(df) # Save the DataFrame to Stata with custom gzip compression df.to_stata("stata_file.dta", write_index=False) print("\nDataFrame has been successfully exported to a Stata file.") 

When we run above program, it produces following result −

 Original DataFrame: 
abcdefg
0a134.0Truea2025-01-01
1b245.0Falseb2025-01-02
2c356.0Truec2025-01-03
If you visit the folder where the Stata dataset files are saved, you can observe the generated .dta file.

Data Type Limitations of Stata Format

Stata data files have specific restrictions on data types that can be stored −

  • Supported data types: int8, int16, int32, float32, float64 and strings with 244 or fewer characters.

  • Missing values: Floating-point data type missing values are stored as the basic missing data type ("." in Stata). And integer missing values are not supported while exporting to Stata format.

  • Automatic Data Type Conversion: If a data type exceeds STATA's range, it is cast to a larger type. For example, int8 (values between -127 and 100) is converted to int16 for values outside this range.

  • Unsupported Data Types: Data types like int64, bool, uint8, uint16, and uint32 are gracefully handled by casting them to the smallest supported type.

  • Precision Warning: Converting "int64" to "float64" may result in a loss of precision for values larger than 2**53.

  • String Limitation: Strings longer than 244 characters are not supported. Attempting to write such data will raise a ValueError.

Importing Data from Stata Format

Pandas provides the read_stata() method to import data from .dta files. It returns a Pandas DataFrame or a pandas.api.typing.StataReader object for incremental reading.

Example

The following example demonstrates how to read dataset from a Stata file using the read_stata() method.

 import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "a": list("abc"), "b": list(range(1, 4)), "c": np.arange(3, 6).astype("u1"), "d": np.arange(4.0, 7.0), "e": [True, False, True], "f": pd.Categorical(list("abc")), "g": pd.date_range("20250101", periods=3) }) print("Original DataFrame:") print(df) # Save the DataFrame to Stata df.to_stata("stata_file.dta", write_index=False) # Read a Stata file back to DataFrame result = pd.read_stata("stata_file.dta") print("\nDataFrame read from Stata file with custom index:") print(result) 

While executing the above code we get the following output −

 Original DataFrame: 
abcdefg
0a134.0Truea2025-01-01
1b245.0Falseb2025-01-02
2c356.0Truec2025-01-03
DataFrame read from Stata file with custom index:
abcdefg
0a134.01a2025-01-01
1b245.00b2025-01-02
2c356.01c2025-01-03

Reading Data from Stata File in Chunks

For large datasets, you can use the chunksize parameter of the read_stata() method to create a pandas.api.typing.StataReader instance to read the file incrementally. This StataReader object can be used as an iterator.

Example

The following example demonstrates how to use the read_stata() method for reading the Stata data into Pandas DataFrame as an iterator by specifying the value of the chunksize parameter.

 import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "a": list("abc"), "b": list(range(1, 4)), "c": np.arange(3, 6).astype("u1"), "d": np.arange(4.0, 7.0), "e": [True, False, True], "f": pd.Categorical(list("abc")), "g": pd.date_range("20250101", periods=3) }) print("Original DataFrame:") print(df) # Save the DataFrame to Stata with custom gzip compression df.to_stata("stata_file.dta", write_index=False) print("\nOutput from a Stata file in chunks:") # Reading Data from Stata File in Chunks with pd.read_stata("stata_file.dta", chunksize=1) as reader: for chunk in reader: print(chunk.shape) 

Following is an output of the above code −

 Original DataFrame: 
abcdefg
0a134.0Truea2025-01-01
1b245.0Falseb2025-01-02
2c356.0Truec2025-01-03
Output from a Stata file in chunks: (1, 7) (1, 7) (1, 7)

Handling Categorical Data in Stata Files

Pandas supports exporting and importing categorical data with value labels in Stata format. Stata supports only string value labels. Non-string categories are converted to strings, potentially resulting in a loss of information.

When importing categorical data from a Stata file into a Pandas DataFrame, it is converted to a Pandas Categorical with integer codes by default. You can use the convert_categoricals=False to import original data values instead of converting them.

Example

This example demonstrates how categorical data in a DataFrame can be saved to a Stata file and read back without converting the categorical values to their string labels.

 import pandas as pd import numpy as np # Create a sample DataFrame df = pd.DataFrame({ "Col_a": pd.Categorical(list("abcdef")), "Col_b": list("aeioou")}, dtype="category") print("Original DataFrame:") print(df) # Save the DataFrame to Stata with custom gzip compression df.to_stata("stata_file.dta", write_index=False) # Read a Stata file by specifying the column to set it as DataFrame Index result = pd.read_stata("stata_file.dta", convert_categoricals=False) print("\nDataFrame read from Stata file with custom index:") print(result) 

Following is an output of the above code −

 Original DataFrame: 
Col_aCol_b
0aa
1be
2ci
3do
4eo
5fu
DataFrame read from Stata file with custom index:
Col_aCol_b
000
111
222
333
443
554
Advertisements
close