Python Pandas - Introduction to Data Structures



Python Pandas Data Structures

Data structures in Pandas are designed to handle data efficiently. They allow for the organization, storage, and modification of data in a way that optimizes memory usage and computational performance. Python Pandas library provides two primary data structures for handling and analyzing data −

  • Series
  • DataFrame

In general programming, the term "data structure" refers to the method of collecting, organizing, and storing data to enable efficient access and modification. Data structures are collections of data types that provide the best way of organizing items (values) in terms of memory usage.

Pandas is built on top of NumPy and integrates well within a scientific computing environment with many other third-party libraries. This tutorial will provide a detailed introduction to these data structures.

Dimension and Description of Pandas Data Structures

Data StructureDimensionsDescription
Series1A one-dimensional labeled homogeneous array, sizeimmutable.
Data Frames2A two-dimensional labeled, size-mutable tabular structure with potentially heterogeneously typed columns.

Working with two or more dimensional arrays can be complex and time-consuming, as users need to carefully consider the data's orientation when writing functions. However, Pandas simplifies this process by reducing the mental effort required. For example, when dealing with tabular data (DataFrame), it's more easy to think in terms of rows and columns instead of axis 0 and axis 1.

Mutability of Pandas Data Structures

All Pandas data structures are value mutable, meaning their contents can be changed. However, their size mutability varies −

  • Series − Size immutable.
  • DataFrame − Size mutable.

Series

A Series is a one-dimensional labeled array that can hold any data type. It can store integers, strings, floating-point numbers, etc. Each value in a Series is associated with a label (index), which can be an integer or a string.

NameSteve
Age35
GenderMale
Rating3.5

Example

Consider the following Series which is a collection of different data types

 import pandas as pd data = ['Steve', '35', 'Male', '3.5'] series = pd.Series(data, index=['Name', 'Age', 'Gender', 'Rating']) print(series) 

On executing the above program, you will get the following output

 Name Steve Age 35 Gender Male Rating 3.5 dtype: object 

Key Points

Following are the key points related to the Pandas Series.

  • Homogeneous data
  • Size Immutable
  • Values of Data Mutable

DataFrame

A DataFrame is a two-dimensional labeled data structure with columns that can hold different data types. It is similar to a table in a database or a spreadsheet. Consider the following data representing the performance rating of a sales team −

NameAgeGenderRating
Steve32Male3.45
Lia28Female4.6
Vin45Male3.9
Katie38Female2.78

Example

The above tabular data can be represented in a DataFrame as follows −

 import pandas as pd # Data represented as a dictionary data = { 'Name': ['Steve', 'Lia', 'Vin', 'Katie'], 'Age': [32, 28, 45, 38], 'Gender': ['Male', 'Female', 'Male', 'Female'], 'Rating': [3.45, 4.6, 3.9, 2.78] } # Creating the DataFrame df = pd.DataFrame(data) # Display the DataFrame print(df) 

Output

On executing the above code you will get the following output −

 Name Age Gender Rating 0 Steve 32 Male 3.45 1 Lia 28 Female 4.60 2 Vin 45 Male 3.90 3 Katie 38 Female 2.78 

Key Points

Following are the key points related the Pandas DataFrame −

  • Heterogeneous data
  • Size Mutable
  • Data Mutable

Purpose of Using More Than One Data Structure

Pandas data structures are flexible containers for lower-dimensional data. For instance, a DataFrame is a container for Series, and a Series is a container for scalars. This flexibility allows for efficient data manipulation and storage.

Building and handling multi-dimensional arrays can be boring and require careful consideration of the data's orientation when writing functions. Pandas reduces this mental effort by providing intuitive data structures.

Example

Following example represents a Series within a DataFrame.

 import pandas as pd # Data represented as a dictionary data = { 'Name': ['Steve', 'Lia', 'Vin', 'Katie'], 'Age': [32, 28, 45, 38], 'Gender': ['Male', 'Female', 'Male', 'Female'], 'Rating': [3.45, 4.6, 3.9, 2.78] } # Creating the DataFrame df = pd.DataFrame(data) # Display a Series within a DataFrame print(df['Name']) 

Output

On executing the above code you will get the following output −

 0 Steve 1 Lia 2 Vin 3 Katie Name: Name, dtype: object 
Advertisements
close