This notebook was prepared by Donne Martin. Source and license info is on GitHub.
Credits: The following are notes taken while working through Python for Data Analysis by Wes McKinney
frompandasimportSeries,DataFrameimportpandasaspdimportnumpyasnp
A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels. The data can be any NumPy data type and the labels are the Series' index.
Create a Series:
ser_1=Series([1,1,2,-3,-5,8,13])ser_1
0 1 1 1 2 2 3 -3 4 -5 5 8 6 13 dtype: int64
Get the array representation of a Series:
ser_1.values
array([ 1, 1, 2, -3, -5, 8, 13])
Index objects are immutable and hold the axis labels and metadata such as names and axis names.
Get the index of the Series:
ser_1.index
Int64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')
Create a Series with a custom index:
ser_2=Series([1,1,2,-3,-5],index=['a','b','c','d','e'])ser_2
a 1 b 1 c 2 d -3 e -5 dtype: int64
Get a value from a Series:
ser_2[4]==ser_2['e']
True
Get a set of values from a Series by passing in a list:
ser_2[['c','a','b']]
c 2 a 1 b 1 dtype: int64
Get values great than 0:
ser_2[ser_2>0]
a 1 b 1 c 2 dtype: int64
Scalar multiply:
ser_2*2
a 2 b 2 c 4 d -6 e -10 dtype: int64
Apply a numpy math function:
importnumpyasnpnp.exp(ser_2)
a 2.718282 b 2.718282 c 7.389056 d 0.049787 e 0.006738 dtype: float64
A Series is like a fixed-length, ordered dict.
Create a series by passing in a dict:
dict_1={'foo':100,'bar':200,'baz':300}ser_3=Series(dict_1)ser_3
bar 200 baz 300 foo 100 dtype: int64
Re-order a Series by passing in an index (indices not found are NaN):
index=['foo','bar','baz','qux']ser_4=Series(dict_1,index=index)ser_4
foo 100 bar 200 baz 300 qux NaN dtype: float64
Check for NaN with the pandas method:
pd.isnull(ser_4)
foo False bar False baz False qux True dtype: bool
Check for NaN with the Series method:
ser_4.isnull()
foo False bar False baz False qux True dtype: bool
Series automatically aligns differently indexed data in arithmetic operations:
ser_3+ser_4
bar 400 baz 600 foo 200 qux NaN dtype: float64
Name a Series:
ser_4.name='foobarbazqux'
Name a Series index:
ser_4.index.name='label'
ser_4
label foo 100 bar 200 baz 300 qux NaN Name: foobarbazqux, dtype: float64
Rename a Series' index in place:
ser_4.index=['fo','br','bz','qx']ser_4
fo 100 br 200 bz 300 qx NaN Name: foobarbazqux, dtype: float64
A DataFrame is a tabular data structure containing an ordered collection of columns. Each column can have a different type. DataFrames have both row and column indices and is analogous to a dict of Series. Row and column operations are treated roughly symmetrically. Columns returned when indexing a DataFrame are views of the underlying data, not a copy. To obtain a copy, use the Series' copy method.
Create a DataFrame:
data_1={'state':['VA','VA','VA','MD','MD'],'year':[2012,2013,2014,2014,2015],'pop':[5.0,5.1,5.2,4.0,4.1]}df_1=DataFrame(data_1)df_1
pop | state | year | |
---|---|---|---|
0 | 5.0 | VA | 2012 |
1 | 5.1 | VA | 2013 |
2 | 5.2 | VA | 2014 |
3 | 4.0 | MD | 2014 |
4 | 4.1 | MD | 2015 |
Create a DataFrame specifying a sequence of columns:
df_2=DataFrame(data_1,columns=['year','state','pop'])df_2
year | state | pop | |
---|---|---|---|
0 | 2012 | VA | 5.0 |
1 | 2013 | VA | 5.1 |
2 | 2014 | VA | 5.2 |
3 | 2014 | MD | 4.0 |
4 | 2015 | MD | 4.1 |
Like Series, columns that are not present in the data are NaN:
df_3=DataFrame(data_1,columns=['year','state','pop','unempl'])df_3
year | state | pop | unempl | |
---|---|---|---|---|
0 | 2012 | VA | 5.0 | NaN |
1 | 2013 | VA | 5.1 | NaN |
2 | 2014 | VA | 5.2 | NaN |
3 | 2014 | MD | 4.0 | NaN |
4 | 2015 | MD | 4.1 | NaN |
Retrieve a column by key, returning a Series:
df_3['state']
0 VA 1 VA 2 VA 3 MD 4 MD Name: state, dtype: object
Retrive a column by attribute, returning a Series:
df_3.year
0 2012 1 2013 2 2014 3 2014 4 2015 Name: year, dtype: int64
Retrieve a row by position:
df_3.ix[0]
year 2012 state VA pop 5 unempl NaN Name: 0, dtype: object
Update a column by assignment:
df_3['unempl']=np.arange(5)df_3
year | state | pop | unempl | |
---|---|---|---|---|
0 | 2012 | VA | 5.0 | 0 |
1 | 2013 | VA | 5.1 | 1 |
2 | 2014 | VA | 5.2 | 2 |
3 | 2014 | MD | 4.0 | 3 |
4 | 2015 | MD | 4.1 | 4 |
Assign a Series to a column (note if assigning a list or array, the length must match the DataFrame, unlike a Series):
unempl=Series([6.0,6.0,6.1],index=[2,3,4])df_3['unempl']=unempldf_3
year | state | pop | unempl | |
---|---|---|---|---|
0 | 2012 | VA | 5.0 | NaN |
1 | 2013 | VA | 5.1 | NaN |
2 | 2014 | VA | 5.2 | 6.0 |
3 | 2014 | MD | 4.0 | 6.0 |
4 | 2015 | MD | 4.1 | 6.1 |
Assign a new column that doesn't exist to create a new column:
df_3['state_dup']=df_3['state']df_3
year | state | pop | unempl | state_dup | |
---|---|---|---|---|---|
0 | 2012 | VA | 5.0 | NaN | VA |
1 | 2013 | VA | 5.1 | NaN | VA |
2 | 2014 | VA | 5.2 | 6.0 | VA |
3 | 2014 | MD | 4.0 | 6.0 | MD |
4 | 2015 | MD | 4.1 | 6.1 | MD |
Delete a column:
deldf_3['state_dup']df_3
year | state | pop | unempl | |
---|---|---|---|---|
0 | 2012 | VA | 5.0 | NaN |
1 | 2013 | VA | 5.1 | NaN |
2 | 2014 | VA | 5.2 | 6.0 |
3 | 2014 | MD | 4.0 | 6.0 |
4 | 2015 | MD | 4.1 | 6.1 |
Create a DataFrame from a nested dict of dicts (the keys in the inner dicts are unioned and sorted to form the index in the result, unless an explicit index is specified):
pop={'VA':{2013:5.1,2014:5.2},'MD':{2014:4.0,2015:4.1}}df_4=DataFrame(pop)df_4
MD | VA | |
---|---|---|
2013 | NaN | 5.1 |
2014 | 4.0 | 5.2 |
2015 | 4.1 | NaN |
Transpose the DataFrame:
df_4.T
2013 | 2014 | 2015 | |
---|---|---|---|
MD | NaN | 4.0 | 4.1 |
VA | 5.1 | 5.2 | NaN |
Create a DataFrame from a dict of Series:
data_2={'VA':df_4['VA'][1:],'MD':df_4['MD'][2:]}df_5=DataFrame(data_2)df_5
MD | VA | |
---|---|---|
2014 | NaN | 5.2 |
2015 | 4.1 | NaN |
Set the DataFrame index name:
df_5.index.name='year'df_5
MD | VA | |
---|---|---|
year | ||
2014 | NaN | 5.2 |
2015 | 4.1 | NaN |
Set the DataFrame columns name:
df_5.columns.name='state'df_5
state | MD | VA |
---|---|---|
year | ||
2014 | NaN | 5.2 |
2015 | 4.1 | NaN |
Return the data contained in a DataFrame as a 2D ndarray:
df_5.values
array([[ nan, 5.2], [ 4.1, nan]])
If the columns are different dtypes, the 2D ndarray's dtype will accomodate all of the columns:
df_3.values
array([[2012, 'VA', 5.0, nan], [2013, 'VA', 5.1, nan], [2014, 'VA', 5.2, 6.0], [2014, 'MD', 4.0, 6.0], [2015, 'MD', 4.1, 6.1]], dtype=object)
Create a new object with the data conformed to a new index. Any missing values are set to NaN.
df_3
year | state | pop | unempl | |
---|---|---|---|---|
0 | 2012 | VA | 5.0 | NaN |
1 | 2013 | VA | 5.1 | NaN |
2 | 2014 | VA | 5.2 | 6.0 |
3 | 2014 | MD | 4.0 | 6.0 |
4 | 2015 | MD | 4.1 | 6.1 |
Reindexing rows returns a new frame with the specified index:
df_3.reindex(list(reversed(range(0,6))))
year | state | pop | unempl | |
---|---|---|---|---|
5 | NaN | NaN | NaN | NaN |
4 | 2015 | MD | 4.1 | 6.1 |
3 | 2014 | MD | 4.0 | 6.0 |
2 | 2014 | VA | 5.2 | 6.0 |
1 | 2013 | VA | 5.1 | NaN |
0 | 2012 | VA | 5.0 | NaN |
Missing values can be set to something other than NaN:
df_3.reindex(range(6,0),fill_value=0)
year | state | pop | unempl |
---|
Interpolate ordered data like a time series:
ser_5=Series(['foo','bar','baz'],index=[0,2,4])
ser_5.reindex(range(5),method='ffill')
0 foo 1 foo 2 bar 3 bar 4 baz dtype: object
ser_5.reindex(range(5),method='bfill')
0 foo 1 bar 2 bar 3 baz 4 baz dtype: object
Reindex columns:
df_3.reindex(columns=['state','pop','unempl','year'])
state | pop | unempl | year | |
---|---|---|---|---|
0 | VA | 5.0 | NaN | 2012 |
1 | VA | 5.1 | NaN | 2013 |
2 | VA | 5.2 | 6.0 | 2014 |
3 | MD | 4.0 | 6.0 | 2014 |
4 | MD | 4.1 | 6.1 | 2015 |
Reindex rows and columns while filling rows:
df_3.reindex(index=list(reversed(range(0,6))),fill_value=0,columns=['state','pop','unempl','year'])
state | pop | unempl | year | |
---|---|---|---|---|
5 | 0 | 0.0 | 0.0 | 0 |
4 | MD | 4.1 | 6.1 | 2015 |
3 | MD | 4.0 | 6.0 | 2014 |
2 | VA | 5.2 | 6.0 | 2014 |
1 | VA | 5.1 | NaN | 2013 |
0 | VA | 5.0 | NaN | 2012 |
Reindex using ix:
df_6=df_3.ix[range(0,7),['state','pop','unempl','year']]df_6
state | pop | unempl | year | |
---|---|---|---|---|
0 | VA | 5.0 | NaN | 2012 |
1 | VA | 5.1 | NaN | 2013 |
2 | VA | 5.2 | 6.0 | 2014 |
3 | MD | 4.0 | 6.0 | 2014 |
4 | MD | 4.1 | 6.1 | 2015 |
5 | NaN | NaN | NaN | NaN |
6 | NaN | NaN | NaN | NaN |
Drop rows from a Series or DataFrame:
df_7=df_6.drop([0,1])df_7
state | pop | unempl | year | |
---|---|---|---|---|
2 | VA | 5.2 | 6.0 | 2014 |
3 | MD | 4.0 | 6.0 | 2014 |
4 | MD | 4.1 | 6.1 | 2015 |
5 | NaN | NaN | NaN | NaN |
6 | NaN | NaN | NaN | NaN |
Drop columns from a DataFrame:
df_7=df_7.drop('unempl',axis=1)df_7
state | pop | year | |
---|---|---|---|
2 | VA | 5.2 | 2014 |
3 | MD | 4.0 | 2014 |
4 | MD | 4.1 | 2015 |
5 | NaN | NaN | NaN |
6 | NaN | NaN | NaN |
Series indexing is similar to NumPy array indexing with the added bonus of being able to use the Series' index values.
ser_2
a 1 b 1 c 2 d -3 e -5 dtype: int64
Select a value from a Series:
ser_2[0]==ser_2['a']
True
Select a slice from a Series:
ser_2[1:4]
b 1 c 2 d -3 dtype: int64
Select specific values from a Series:
ser_2[['b','c','d']]
b 1 c 2 d -3 dtype: int64
Select from a Series based on a filter:
ser_2[ser_2>0]
a 1 b 1 c 2 dtype: int64
Select a slice from a Series with labels (note the end point is inclusive):
ser_2['a':'b']
a 1 b 1 dtype: int64
Assign to a Series slice (note the end point is inclusive):
ser_2['a':'b']=0ser_2
a 0 b 0 c 2 d -3 e -5 dtype: int64
Pandas supports indexing into a DataFrame.
df_6
state | pop | unempl | year | |
---|---|---|---|---|
0 | VA | 5.0 | NaN | 2012 |
1 | VA | 5.1 | NaN | 2013 |
2 | VA | 5.2 | 6.0 | 2014 |
3 | MD | 4.0 | 6.0 | 2014 |
4 | MD | 4.1 | 6.1 | 2015 |
5 | NaN | NaN | NaN | NaN |
6 | NaN | NaN | NaN | NaN |
Select specified columns from a DataFrame:
df_6[['pop','unempl']]
pop | unempl | |
---|---|---|
0 | 5.0 | NaN |
1 | 5.1 | NaN |
2 | 5.2 | 6.0 |
3 | 4.0 | 6.0 |
4 | 4.1 | 6.1 |
5 | NaN | NaN |
6 | NaN | NaN |
Select a slice from a DataFrame:
df_6[:2]
state | pop | unempl | year | |
---|---|---|---|---|
0 | VA | 5.0 | NaN | 2012 |
1 | VA | 5.1 | NaN | 2013 |
Select from a DataFrame based on a filter:
df_6[df_6['pop']>5]
state | pop | unempl | year | |
---|---|---|---|---|
1 | VA | 5.1 | NaN | 2013 |
2 | VA | 5.2 | 6 | 2014 |
Perform a scalar comparison on a DataFrame:
df_6>5
state | pop | unempl | year | |
---|---|---|---|---|
0 | True | False | False | True |
1 | True | True | False | True |
2 | True | True | True | True |
3 | True | False | True | True |
4 | True | False | True | True |
5 | True | False | False | False |
6 | True | False | False | False |
Perform a scalar comparison on a DataFrame, retain the values that pass the filter:
df_6[df_6>5]
state | pop | unempl | year | |
---|---|---|---|---|
0 | VA | NaN | NaN | 2012 |
1 | VA | 5.1 | NaN | 2013 |
2 | VA | 5.2 | 6.0 | 2014 |
3 | MD | NaN | 6.0 | 2014 |
4 | MD | NaN | 6.1 | 2015 |
5 | NaN | NaN | NaN | NaN |
6 | NaN | NaN | NaN | NaN |
Select a slice of rows from a DataFrame (note the end point is inclusive):
df_6.ix[2:3]
state | pop | unempl | year | |
---|---|---|---|---|
2 | VA | 5.2 | 6 | 2014 |
3 | MD | 4.0 | 6 | 2014 |
Select a slice of rows from a specific column of a DataFrame:
df_6.ix[0:2,'pop']
0 5.0 1 5.1 2 5.2 Name: pop, dtype: float64
Select rows based on an arithmetic operation on a specific row:
df_6.ix[df_6.unempl>5.0]
state | pop | unempl | year | |
---|---|---|---|---|
2 | VA | 5.2 | 6.0 | 2014 |
3 | MD | 4.0 | 6.0 | 2014 |
4 | MD | 4.1 | 6.1 | 2015 |
Adding Series objects results in the union of index pairs if the pairs are not the same, resulting in NaN for indices that do not overlap:
np.random.seed(0)ser_6=Series(np.random.randn(5),index=['a','b','c','d','e'])ser_6
a 1.764052 b 0.400157 c 0.978738 d 2.240893 e 1.867558 dtype: float64
np.random.seed(1)ser_7=Series(np.random.randn(5),index=['a','c','e','f','g'])ser_7
a 1.624345 c -0.611756 e -0.528172 f -1.072969 g 0.865408 dtype: float64
ser_6+ser_7
a 3.388398 b NaN c 0.366982 d NaN e 1.339386 f NaN g NaN dtype: float64
Set a fill value instead of NaN for indices that do not overlap:
ser_6.add(ser_7,fill_value=0)
a 3.388398 b 0.400157 c 0.366982 d 2.240893 e 1.339386 f -1.072969 g 0.865408 dtype: float64
Adding DataFrame objects results in the union of index pairs for rows and columns if the pairs are not the same, resulting in NaN for indices that do not overlap:
np.random.seed(0)df_8=DataFrame(np.random.rand(9).reshape((3,3)),columns=['a','b','c'])df_8
a | b | c | |
---|---|---|---|
0 | 0.548814 | 0.715189 | 0.602763 |
1 | 0.544883 | 0.423655 | 0.645894 |
2 | 0.437587 | 0.891773 | 0.963663 |
np.random.seed(1)df_9=DataFrame(np.random.rand(9).reshape((3,3)),columns=['b','c','d'])df_9
b | c | d | |
---|---|---|---|
0 | 0.417022 | 0.720324 | 0.000114 |
1 | 0.302333 | 0.146756 | 0.092339 |
2 | 0.186260 | 0.345561 | 0.396767 |
df_8+df_9
a | b | c | d | |
---|---|---|---|---|
0 | NaN | 1.132211 | 1.323088 | NaN |
1 | NaN | 0.725987 | 0.792650 | NaN |
2 | NaN | 1.078033 | 1.309223 | NaN |
Set a fill value instead of NaN for indices that do not overlap:
df_10=df_8.add(df_9,fill_value=0)df_10
a | b | c | d | |
---|---|---|---|---|
0 | 0.548814 | 1.132211 | 1.323088 | 0.000114 |
1 | 0.544883 | 0.725987 | 0.792650 | 0.092339 |
2 | 0.437587 | 1.078033 | 1.309223 | 0.396767 |
Like NumPy, pandas supports arithmetic operations between DataFrames and Series.
Match the index of the Series on the DataFrame's columns, broadcasting down the rows:
ser_8=df_10.ix[0]df_11=df_10-ser_8df_11
a | b | c | d | |
---|---|---|---|---|
0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
1 | -0.003930 | -0.406224 | -0.530438 | 0.092224 |
2 | -0.111226 | -0.054178 | -0.013864 | 0.396653 |
Match the index of the Series on the DataFrame's columns, broadcasting down the rows and union the indices that do not match:
ser_9=Series(range(3),index=['a','d','e'])ser_9
a 0 d 1 e 2 dtype: int64
df_11-ser_9
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.000000 | NaN | NaN | -1.000000 | NaN |
1 | -0.003930 | NaN | NaN | -0.907776 | NaN |
2 | -0.111226 | NaN | NaN | -0.603347 | NaN |
Broadcast over the columns and match the rows (axis=0) by using an arithmetic method:
df_10
a | b | c | d | |
---|---|---|---|---|
0 | 0.548814 | 1.132211 | 1.323088 | 0.000114 |
1 | 0.544883 | 0.725987 | 0.792650 | 0.092339 |
2 | 0.437587 | 1.078033 | 1.309223 | 0.396767 |
ser_10=Series([100,200,300])ser_10
0 100 1 200 2 300 dtype: int64
df_10.sub(ser_10,axis=0)
a | b | c | d | |
---|---|---|---|---|
0 | -99.451186 | -98.867789 | -98.676912 | -99.999886 |
1 | -199.455117 | -199.274013 | -199.207350 | -199.907661 |
2 | -299.562413 | -298.921967 | -298.690777 | -299.603233 |
NumPy ufuncs (element-wise array methods) operate on pandas objects:
df_11=np.abs(df_11)df_11
a | b | c | d | |
---|---|---|---|---|
0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
1 | 0.003930 | 0.406224 | 0.530438 | 0.092224 |
2 | 0.111226 | 0.054178 | 0.013864 | 0.396653 |
Apply a function on 1D arrays to each column:
func_1=lambdax:x.max()-x.min()df_11.apply(func_1)
a 0.111226 b 0.406224 c 0.530438 d 0.396653 dtype: float64
Apply a function on 1D arrays to each row:
df_11.apply(func_1,axis=1)
0 0.000000 1 0.526508 2 0.382789 dtype: float64
Apply a function and return a DataFrame:
func_2=lambdax:Series([x.min(),x.max()],index=['min','max'])df_11.apply(func_2)
a | b | c | d | |
---|---|---|---|---|
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 0.111226 | 0.406224 | 0.530438 | 0.396653 |
Apply an element-wise Python function to a DataFrame:
func_3=lambdax:'%.2f'%xdf_11.applymap(func_3)
a | b | c | d | |
---|---|---|---|---|
0 | 0.00 | 0.00 | 0.00 | 0.00 |
1 | 0.00 | 0.41 | 0.53 | 0.09 |
2 | 0.11 | 0.05 | 0.01 | 0.40 |
Apply an element-wise Python function to a Series:
df_11['a'].map(func_3)
0 0.00 1 0.00 2 0.11 Name: a, dtype: object
ser_4
fo 100 br 200 bz 300 qx NaN Name: foobarbazqux, dtype: float64
Sort a Series by its index:
ser_4.sort_index()
br 200 bz 300 fo 100 qx NaN Name: foobarbazqux, dtype: float64
Sort a Series by its values:
ser_4.sort_values()
fo 100 br 200 bz 300 qx NaN Name: foobarbazqux, dtype: float64
df_12=DataFrame(np.arange(12).reshape((3,4)),index=['three','one','two'],columns=['c','a','b','d'])df_12
c | a | b | d | |
---|---|---|---|---|
three | 0 | 1 | 2 | 3 |
one | 4 | 5 | 6 | 7 |
two | 8 | 9 | 10 | 11 |
Sort a DataFrame by its index:
df_12.sort_index()
c | a | b | d | |
---|---|---|---|---|
one | 4 | 5 | 6 | 7 |
three | 0 | 1 | 2 | 3 |
two | 8 | 9 | 10 | 11 |
Sort a DataFrame by columns in descending order:
df_12.sort_index(axis=1,ascending=False)
d | c | b | a | |
---|---|---|---|---|
three | 3 | 0 | 2 | 1 |
one | 7 | 4 | 6 | 5 |
two | 11 | 8 | 10 | 9 |
Sort a DataFrame's values by column:
df_12.sort_values(by=['d','c'])
c | a | b | d | |
---|---|---|---|---|
three | 0 | 1 | 2 | 3 |
one | 4 | 5 | 6 | 7 |
two | 8 | 9 | 10 | 11 |
Ranking is similar to numpy.argsort except that ties are broken by assigning each group the mean rank:
ser_11=Series([7,-5,7,4,2,0,4,7])ser_11=ser_11.sort_values()ser_11
1 -5 5 0 4 2 3 4 6 4 0 7 2 7 7 7 dtype: int64
ser_11.rank()
1 1.0 5 2.0 4 3.0 3 4.5 6 4.5 0 7.0 2 7.0 7 7.0 dtype: float64
Rank a Series according to when they appear in the data:
ser_11.rank(method='first')
1 1 5 2 4 3 3 4 6 5 0 6 2 7 7 8 dtype: float64
Rank a Series in descending order, using the maximum rank for the group:
ser_11.rank(ascending=False,method='max')
1 8 5 7 4 6 3 5 6 5 0 3 2 3 7 3 dtype: float64
DataFrames can rank over rows or columns.
df_13=DataFrame({'foo':[7,-5,7,4,2,0,4,7],'bar':[-5,4,2,0,4,7,7,8],'baz':[-1,2,3,0,5,9,9,5]})df_13
bar | baz | foo | |
---|---|---|---|
0 | -5 | -1 | 7 |
1 | 4 | 2 | -5 |
2 | 2 | 3 | 7 |
3 | 0 | 0 | 4 |
4 | 4 | 5 | 2 |
5 | 7 | 9 | 0 |
6 | 7 | 9 | 4 |
7 | 8 | 5 | 7 |
Rank a DataFrame over rows:
df_13.rank()
bar | baz | foo | |
---|---|---|---|
0 | 1.0 | 1.0 | 7.0 |
1 | 4.5 | 3.0 | 1.0 |
2 | 3.0 | 4.0 | 7.0 |
3 | 2.0 | 2.0 | 4.5 |
4 | 4.5 | 5.5 | 3.0 |
5 | 6.5 | 7.5 | 2.0 |
6 | 6.5 | 7.5 | 4.5 |
7 | 8.0 | 5.5 | 7.0 |
Rank a DataFrame over columns:
df_13.rank(axis=1)
bar | baz | foo | |
---|---|---|---|
0 | 1.0 | 2.0 | 3 |
1 | 3.0 | 2.0 | 1 |
2 | 1.0 | 2.0 | 3 |
3 | 1.5 | 1.5 | 3 |
4 | 2.0 | 3.0 | 1 |
5 | 2.0 | 3.0 | 1 |
6 | 2.0 | 3.0 | 1 |
7 | 3.0 | 1.0 | 2 |
Labels do not have to be unique in Pandas:
ser_12=Series(range(5),index=['foo','foo','bar','bar','baz'])ser_12
foo 0 foo 1 bar 2 bar 3 baz 4 dtype: int64
ser_12.index.is_unique
False
Select Series elements:
ser_12['foo']
foo 0 foo 1 dtype: int64
Select DataFrame elements:
df_14=DataFrame(np.random.randn(5,4),index=['foo','foo','bar','bar','baz'])df_14
0 | 1 | 2 | 3 | |
---|---|---|---|---|
foo | -2.363469 | 1.135345 | -1.017014 | 0.637362 |
foo | -0.859907 | 1.772608 | -1.110363 | 0.181214 |
bar | 0.564345 | -0.566510 | 0.729976 | 0.372994 |
bar | 0.533811 | -0.091973 | 1.913820 | 0.330797 |
baz | 1.141943 | -1.129595 | -0.850052 | 0.960820 |
df_14.ix['bar']
0 | 1 | 2 | 3 | |
---|---|---|---|---|
bar | 0.564345 | -0.566510 | 0.729976 | 0.372994 |
bar | 0.533811 | -0.091973 | 1.913820 | 0.330797 |
Unlike NumPy arrays, Pandas descriptive statistics automatically exclude missing data. NaN values are excluded unless the entire row or column is NA.
df_6
state | pop | unempl | year | |
---|---|---|---|---|
0 | VA | 5.0 | NaN | 2012 |
1 | VA | 5.1 | NaN | 2013 |
2 | VA | 5.2 | 6.0 | 2014 |
3 | MD | 4.0 | 6.0 | 2014 |
4 | MD | 4.1 | 6.1 | 2015 |
5 | NaN | NaN | NaN | NaN |
6 | NaN | NaN | NaN | NaN |
df_6.sum()
pop 23.4 unempl 18.1 year 10068.0 dtype: float64
Sum over the rows:
df_6.sum(axis=1)
0 2017.0 1 2018.1 2 2025.2 3 2024.0 4 2025.2 5 0.0 6 0.0 dtype: float64
Account for NaNs:
df_6.sum(axis=1,skipna=False)
0 NaN 1 NaN 2 2025.2 3 2024.0 4 2025.2 5 NaN 6 NaN dtype: float64
frompandasimportSeries,DataFrameimportpandasaspd
Setup a DataFrame:
data_1={'state':['VA','VA','VA','MD','MD'],'year':[2012,2013,2014,2014,2015],'population':[5.0,5.1,5.2,4.0,4.1]}df_1=DataFrame(data_1)df_1
population | state | year | |
---|---|---|---|
0 | 5.0 | VA | 2012 |
1 | 5.1 | VA | 2013 |
2 | 5.2 | VA | 2014 |
3 | 4.0 | MD | 2014 |
4 | 4.1 | MD | 2015 |
Replace all occurrences of a string with another string, in place (no copy):
df_1.replace('VA','VIRGINIA',inplace=True)df_1
population | state | year | |
---|---|---|---|
0 | 5.0 | VIRGINIA | 2012 |
1 | 5.1 | VIRGINIA | 2013 |
2 | 5.2 | VIRGINIA | 2014 |
3 | 4.0 | MD | 2014 |
4 | 4.1 | MD | 2015 |
In a specified column, replace all occurrences of a string with another string, in place (no copy):
df_1.replace({'state':{'MD':'MARYLAND'}},inplace=True)df_1
population | state | year | |
---|---|---|---|
0 | 5.0 | VIRGINIA | 2012 |
1 | 5.1 | VIRGINIA | 2013 |
2 | 5.2 | VIRGINIA | 2014 |
3 | 4.0 | MARYLAND | 2014 |
4 | 4.1 | MARYLAND | 2015 |
Drop the 'population' column and return a copy of the DataFrame:
df_2=df_1.drop('population',axis=1)df_2
state | year | |
---|---|---|
0 | VIRGINIA | 2012 |
1 | VIRGINIA | 2013 |
2 | VIRGINIA | 2014 |
3 | MARYLAND | 2014 |
4 | MARYLAND | 2015 |
Concatenate two DataFrames:
data_2={'state':['NY','NY','NY','FL','FL'],'year':[2012,2013,2014,2014,2015],'population':[6.0,6.1,6.2,3.0,3.1]}df_3=DataFrame(data_2)df_3
population | state | year | |
---|---|---|---|
0 | 6.0 | NY | 2012 |
1 | 6.1 | NY | 2013 |
2 | 6.2 | NY | 2014 |
3 | 3.0 | FL | 2014 |
4 | 3.1 | FL | 2015 |
df_4=pd.concat([df_1,df_3])df_4
population | state | year | |
---|---|---|---|
0 | 5.0 | VIRGINIA | 2012 |
1 | 5.1 | VIRGINIA | 2013 |
2 | 5.2 | VIRGINIA | 2014 |
3 | 4.0 | MARYLAND | 2014 |
4 | 4.1 | MARYLAND | 2015 |
0 | 6.0 | NY | 2012 |
1 | 6.1 | NY | 2013 |
2 | 6.2 | NY | 2014 |
3 | 3.0 | FL | 2014 |
4 | 3.1 | FL | 2015 |
frompandasimportSeries,DataFrameimportpandasaspd
Read data from a CSV file into a DataFrame (use sep='\t' for TSV):
df_1=pd.read_csv("../data/ozone.csv")
Get a summary of the DataFrame:
df_1.describe()
Ozone | Solar.R | Wind | Temp | Month | Day | |
---|---|---|---|---|---|---|
count | 116.000000 | 146.000000 | 153.000000 | 153.000000 | 153.000000 | 153.000000 |
mean | 42.129310 | 185.931507 | 9.957516 | 77.882353 | 6.993464 | 15.803922 |
std | 32.987885 | 90.058422 | 3.523001 | 9.465270 | 1.416522 | 8.864520 |
min | 1.000000 | 7.000000 | 1.700000 | 56.000000 | 5.000000 | 1.000000 |
25% | 18.000000 | 115.750000 | 7.400000 | 72.000000 | 6.000000 | 8.000000 |
50% | 31.500000 | 205.000000 | 9.700000 | 79.000000 | 7.000000 | 16.000000 |
75% | 63.250000 | 258.750000 | 11.500000 | 85.000000 | 8.000000 | 23.000000 |
max | 168.000000 | 334.000000 | 20.700000 | 97.000000 | 9.000000 | 31.000000 |
List the first five rows of the DataFrame:
df_1.head()
Ozone | Solar.R | Wind | Temp | Month | Day | |
---|---|---|---|---|---|---|
0 | 41 | 190 | 7.4 | 67 | 5 | 1 |
1 | 36 | 118 | 8.0 | 72 | 5 | 2 |
2 | 12 | 149 | 12.6 | 74 | 5 | 3 |
3 | 18 | 313 | 11.5 | 62 | 5 | 4 |
4 | NaN | NaN | 14.3 | 56 | 5 | 5 |
Create a copy of the CSV file, encoded in UTF-8 and hiding the index and header labels:
df_1.to_csv('../data/ozone_copy.csv',encoding='utf-8',index=False,header=False)
View the data directory:
!ls -l ../data/
total 1016 -rw-r--r-- 1 donnemartin staff 437903 Jul 7 2015 churn.csv -rwxr-xr-x 1 donnemartin staff 72050 Jul 7 2015 confusion_matrix.png -rw-r--r-- 1 donnemartin staff 2902 Jul 7 2015 ozone.csv -rw-r--r-- 1 donnemartin staff 3324 Apr 1 07:18 ozone_copy.csv drwxr-xr-x 10 donnemartin staff 340 Jul 7 2015 titanic