Comparison with R / R libraries#

Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use R for, this page was started to provide a more detailed look at the R language and its many third party libraries as they relate to pandas. In comparisons with R and CRAN libraries, we care about the following things:

  • Functionality / flexibility: what can/cannot be done with each tool

  • Performance: how fast are operations. Hard numbers/benchmarks are preferable

  • Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code comparisons)

This page is also here to offer a bit of a translation guide for users of these R packages.

Quick reference#

We’ll start off with a quick reference guide pairing some common R operations using dplyr with pandas equivalents.

Querying, filtering, sampling#

R

pandas

dim(df)

df.shape

head(df)

df.head()

slice(df,1:10)

df.iloc[:9]

filter(df,col1==1,col2==1)

df.query('col1==1&col2==1')

df[df$col1==1&df$col2==1,]

df[(df.col1==1)&(df.col2==1)]

select(df,col1,col2)

df[['col1','col2']]

select(df,col1:col3)

df.loc[:,'col1':'col3']

select(df,-(col1:col3))

df.drop(cols_to_drop,axis=1) but see [1]

distinct(select(df,col1))

df[['col1']].drop_duplicates()

distinct(select(df,col1,col2))

df[['col1','col2']].drop_duplicates()

sample_n(df,10)

df.sample(n=10)

sample_frac(df,0.01)

df.sample(frac=0.01)

Sorting#

R

pandas

arrange(df,col1,col2)

df.sort_values(['col1','col2'])

arrange(df,desc(col1))

df.sort_values('col1',ascending=False)

Transforming#

R

pandas

select(df,col_one=col1)

df.rename(columns={'col1':'col_one'})['col_one']

rename(df,col_one=col1)

df.rename(columns={'col1':'col_one'})

mutate(df,c=a-b)

df.assign(c=df['a']-df['b'])

Grouping and summarizing#

R

pandas

summary(df)

df.describe()

gdf<-group_by(df,col1)

gdf=df.groupby('col1')

summarise(gdf,avg=mean(col1,na.rm=TRUE))

df.groupby('col1').agg({'col1':'mean'})

summarise(gdf,total=sum(col1))

df.groupby('col1').sum()

Base R#

Slicing with R’s c#

R makes it easy to access data.frame columns by name

df<-data.frame(a=rnorm(5),b=rnorm(5),c=rnorm(5),d=rnorm(5),e=rnorm(5))df[,c("a","c","e")]

or by integer location

df<-data.frame(matrix(rnorm(1000),ncol=100))df[,c(1:10,25:30,40,50:100)]

Selecting multiple columns by name in pandas is straightforward

In [1]: df=pd.DataFrame(np.random.randn(10,3),columns=list("abc"))In [2]: df[["a","c"]]Out[2]:  a c0 0.469112 -1.5090591 -1.135632 -0.1732152 0.119209 -0.8618493 -2.104569 1.0718044 0.721555 -1.0395755 0.271860 0.5670206 0.276232 -0.6736907 0.113648 0.5249888 0.404705 -1.7150029 -1.039268 -1.157892In [3]: df.loc[:,["a","c"]]Out[3]:  a c0 0.469112 -1.5090591 -1.135632 -0.1732152 0.119209 -0.8618493 -2.104569 1.0718044 0.721555 -1.0395755 0.271860 0.5670206 0.276232 -0.6736907 0.113648 0.5249888 0.404705 -1.7150029 -1.039268 -1.157892

Selecting multiple noncontiguous columns by integer location can be achieved with a combination of the iloc indexer attribute and numpy.r_.

In [4]: named=list("abcdefg")In [5]: n=30In [6]: columns=named+np.arange(len(named),n).tolist()In [7]: df=pd.DataFrame(np.random.randn(n,n),columns=columns)In [8]: df.iloc[:,np.r_[:10,24:30]]Out[8]:  a b c ... 27 28 290 -1.344312 0.844885 1.075770 ... 0.813850 0.132003 -0.8273171 -0.076467 -1.187678 1.130127 ... 0.149748 -0.732339 0.6877382 0.176444 0.403310 -0.154951 ... -0.493662 0.600178 0.2742303 0.132885 -0.023688 2.410179 ... 0.109121 1.126203 -0.9773494 1.474071 -0.064034 -1.282782 ... -0.858447 0.306996 -0.028665.. ... ... ... ... ... ... ...25 1.492125 -0.068190 0.681456 ... 0.428572 0.880609 0.48764526 0.725238 0.624607 -0.141185 ... 1.008500 1.424017 0.71711027 1.262419 1.950057 0.301038 ... 1.007824 2.826008 1.45838328 -1.585746 -0.899734 0.921494 ... 0.577223 -1.088417 0.32668729 -0.986248 0.169729 -1.158091 ... -2.013086 -1.602549 0.333109[30 rows x 16 columns]

aggregate#

In R you may want to split data into subsets and compute the mean for each. Using a data.frame called df and splitting it into groups by1 and by2:

df<-data.frame(v1=c(1,3,5,7,8,3,5,NA,4,5,7,9),v2=c(11,33,55,77,88,33,55,NA,44,55,77,99),by1=c("red","blue",1,2,NA,"big",1,2,"red",1,NA,12),by2=c("wet","dry",99,95,NA,"damp",95,99,"red",99,NA,NA))aggregate(x=df[,c("v1","v2")],by=list(mydf2$by1,mydf2$by2),FUN=mean)

The groupby() method is similar to base R aggregate function.

In [9]: df=pd.DataFrame( ...: { ...: "v1":[1,3,5,7,8,3,5,np.nan,4,5,7,9], ...: "v2":[11,33,55,77,88,33,55,np.nan,44,55,77,99], ...: "by1":["red","blue",1,2,np.nan,"big",1,2,"red",1,np.nan,12], ...: "by2":[ ...: "wet", ...: "dry", ...: 99, ...: 95, ...: np.nan, ...: "damp", ...: 95, ...: 99, ...: "red", ...: 99, ...: np.nan, ...: np.nan, ...: ], ...: } ...: ) ...: In [10]: g=df.groupby(["by1","by2"])In [11]: g[["v1","v2"]].mean()Out[11]:  v1 v2by1 by2 1 95 5.0 55.0 99 5.0 55.02 95 7.0 77.0 99 NaN NaNbig damp 3.0 33.0blue dry 3.0 33.0red red 4.0 44.0 wet 1.0 11.0

For more details and examples see the groupby documentation.

match / %in%#

A common way to select data in R is using %in% which is defined using the function match. The operator %in% is used to return a logical vector indicating if there is a match or not:

s<-0:4s%in%c(2,4)

The isin() method is similar to R %in% operator:

In [12]: s=pd.Series(np.arange(5),dtype=np.float32)In [13]: s.isin([2,4])Out[13]: 0 False1 False2 True3 False4 Truedtype: bool

The match function returns a vector of the positions of matches of its first argument in its second:

s<-0:4match(s,c(2,4))

For more details and examples see the reshaping documentation.

tapply#

tapply is similar to aggregate, but data can be in a ragged array, since the subclass sizes are possibly irregular. Using a data.frame called baseball, and retrieving information based on the array team:

baseball<-data.frame(team=gl(5,5,labels=paste("Team",LETTERS[1:5])),player=sample(letters,25),batting.average=runif(25,.200,.400))tapply(baseball$batting.average,baseball.example$team,max)

In pandas we may use pivot_table() method to handle this:

In [14]: importrandomIn [15]: importstringIn [16]: baseball=pd.DataFrame( ....: { ....: "team":["team %d"%(x+1)forxinrange(5)]*5, ....: "player":random.sample(list(string.ascii_lowercase),25), ....: "batting avg":np.random.uniform(0.200,0.400,25), ....: } ....: ) ....: In [17]: baseball.pivot_table(values="batting avg",columns="team",aggfunc="max")Out[17]: team team 1 team 2 team 3 team 4 team 5batting avg 0.352134 0.295327 0.397191 0.394457 0.396194

For more details and examples see the reshaping documentation.

subset#

The query() method is similar to the base R subset function. In R you might want to get the rows of a data.frame where one column’s values are less than another column’s values:

df<-data.frame(a=rnorm(10),b=rnorm(10))subset(df,a<=b)df[df$a<=df$b,]# note the comma

In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it were an index/slice as well as standard boolean indexing:

In [18]: df=pd.DataFrame({"a":np.random.randn(10),"b":np.random.randn(10)})In [19]: df.query("a <= b")Out[19]:  a b1 0.174950 0.5528872 -0.023167 0.1480843 -0.495291 -0.3002184 -0.860736 0.1973785 -1.134146 1.7207807 -0.290098 0.0835158 0.238636 0.946550In [20]: df[df["a"]<=df["b"]]Out[20]:  a b1 0.174950 0.5528872 -0.023167 0.1480843 -0.495291 -0.3002184 -0.860736 0.1973785 -1.134146 1.7207807 -0.290098 0.0835158 0.238636 0.946550In [21]: df.loc[df["a"]<=df["b"]]Out[21]:  a b1 0.174950 0.5528872 -0.023167 0.1480843 -0.495291 -0.3002184 -0.860736 0.1973785 -1.134146 1.7207807 -0.290098 0.0835158 0.238636 0.946550

For more details and examples see the query documentation.

with#

An expression using a data.frame called df in R with the columns a and b would be evaluated using with like so:

df<-data.frame(a=rnorm(10),b=rnorm(10))with(df,a+b)df$a+df$b# same as the previous expression

In pandas the equivalent expression, using the eval() method, would be:

In [22]: df=pd.DataFrame({"a":np.random.randn(10),"b":np.random.randn(10)})In [23]: df.eval("a + b")Out[23]: 0 -0.0914301 -2.4838902 -0.2527283 -0.6264444 -0.2617405 2.1495036 -0.3322147 0.7993318 -2.3772459 2.104677dtype: float64In [24]: df["a"]+df["b"]# same as the previous expressionOut[24]: 0 -0.0914301 -2.4838902 -0.2527283 -0.6264444 -0.2617405 2.1495036 -0.3322147 0.7993318 -2.3772459 2.104677dtype: float64

In certain cases eval() will be much faster than evaluation in pure Python. For more details and examples see the eval documentation.

plyr#

plyr is an R library for the split-apply-combine strategy for data analysis. The functions revolve around three data structures in R, a for arrays, l for lists, and d for data.frame. The table below shows how these data structures could be mapped in Python.

R

Python

array

list

lists

dictionary or list of objects

data.frame

dataframe

ddply#

An expression using a data.frame called df in R where you want to summarize x by month:

require(plyr)df<-data.frame(x=runif(120,1,168),y=runif(120,7,334),z=runif(120,1.7,20.7),month=rep(c(5,6,7,8),30),week=sample(1:4,120,TRUE))ddply(df,.(month,week),summarize,mean=round(mean(x),2),sd=round(sd(x),2))

In pandas the equivalent expression, using the groupby() method, would be:

In [25]: df=pd.DataFrame( ....: { ....: "x":np.random.uniform(1.0,168.0,120), ....: "y":np.random.uniform(7.0,334.0,120), ....: "z":np.random.uniform(1.7,20.7,120), ....: "month":[5,6,7,8]*30, ....: "week":np.random.randint(1,4,120), ....: } ....: ) ....: In [26]: grouped=df.groupby(["month","week"])In [27]: grouped["x"].agg(["mean","std"])Out[27]:  mean stdmonth week 5 1 63.653367 40.601965 2 78.126605 53.342400 3 92.091886 57.6301106 1 81.747070 54.339218 2 70.971205 54.687287 3 100.968344 54.0100817 1 61.576332 38.844274 2 61.733510 48.209013 3 71.688795 37.5956388 1 62.741922 34.618153 2 91.774627 49.790202 3 73.936856 60.773900

For more details and examples see the groupby documentation.

reshape / reshape2#

meltarray#

An expression using a 3 dimensional array called a in R where you want to melt it into a data.frame:

a<-array(c(1:23,NA),c(2,3,4))data.frame(melt(a))

In Python, since a is a list, you can simply use list comprehension.

In [28]: a=np.array(list(range(1,24))+[np.NAN]).reshape(2,3,4)In [29]: pd.DataFrame([tuple(list(x)+[val])forx,valinnp.ndenumerate(a)])Out[29]:  0 1 2 30 0 0 0 1.01 0 0 1 2.02 0 0 2 3.03 0 0 3 4.04 0 1 0 5.0.. .. .. .. ...19 1 1 3 20.020 1 2 0 21.021 1 2 1 22.022 1 2 2 23.023 1 2 3 NaN[24 rows x 4 columns]

meltlist#

An expression using a list called a in R where you want to melt it into a data.frame:

a<-as.list(c(1:4,NA))data.frame(melt(a))

In Python, this list would be a list of tuples, so DataFrame() method would convert it to a dataframe as required.

In [30]: a=list(enumerate(list(range(1,5))+[np.NAN]))In [31]: pd.DataFrame(a)Out[31]:  0 10 0 1.01 1 2.02 2 3.03 3 4.04 4 NaN

For more details and examples see the Into to Data Structures documentation.

meltdf#

An expression using a data.frame called cheese in R where you want to reshape the data.frame:

cheese<-data.frame(first=c('John','Mary'),last=c('Doe','Bo'),height=c(5.5,6.0),weight=c(130,150))melt(cheese,id=c("first","last"))

In Python, the melt() method is the R equivalent:

In [32]: cheese=pd.DataFrame( ....: { ....: "first":["John","Mary"], ....: "last":["Doe","Bo"], ....: "height":[5.5,6.0], ....: "weight":[130,150], ....: } ....: ) ....: In [33]: pd.melt(cheese,id_vars=["first","last"])Out[33]:  first last variable value0 John Doe height 5.51 Mary Bo height 6.02 John Doe weight 130.03 Mary Bo weight 150.0In [34]: cheese.set_index(["first","last"]).stack(future_stack=True)# alternative wayOut[34]: first last John Doe height 5.5 weight 130.0Mary Bo height 6.0 weight 150.0dtype: float64

For more details and examples see the reshaping documentation.

cast#

In R acast is an expression using a data.frame called df in R to cast into a higher dimensional array:

df<-data.frame(x=runif(12,1,168),y=runif(12,7,334),z=runif(12,1.7,20.7),month=rep(c(5,6,7),4),week=rep(c(1,2),6))mdf<-melt(df,id=c("month","week"))acast(mdf,week~month~variable,mean)

In Python the best way is to make use of pivot_table():

In [35]: df=pd.DataFrame( ....: { ....: "x":np.random.uniform(1.0,168.0,12), ....: "y":np.random.uniform(7.0,334.0,12), ....: "z":np.random.uniform(1.7,20.7,12), ....: "month":[5,6,7]*4, ....: "week":[1,2]*6, ....: } ....: ) ....: In [36]: mdf=pd.melt(df,id_vars=["month","week"])In [37]: pd.pivot_table( ....: mdf, ....: values="value", ....: index=["variable","week"], ....: columns=["month"], ....: aggfunc="mean", ....: ) ....: Out[37]: month 5 6 7variable week x 1 93.888747 98.762034 55.219673 2 94.391427 38.112932 83.942781y 1 94.306912 279.454811 227.840449 2 87.392662 193.028166 173.899260z 1 11.016009 10.079307 16.170549 2 8.476111 17.638509 19.003494

Similarly for dcast which uses a data.frame called df in R to aggregate information based on Animal and FeedType:

df<-data.frame(Animal=c('Animal1','Animal2','Animal3','Animal2','Animal1','Animal2','Animal3'),FeedType=c('A','B','A','A','B','B','A'),Amount=c(10,7,4,2,5,6,2))dcast(df,Animal~FeedType,sum,fill=NaN)# Alternative method using base Rwith(df,tapply(Amount,list(Animal,FeedType),sum))

Python can approach this in two different ways. Firstly, similar to above using pivot_table():

In [38]: df=pd.DataFrame( ....: { ....: "Animal":[ ....: "Animal1", ....: "Animal2", ....: "Animal3", ....: "Animal2", ....: "Animal1", ....: "Animal2", ....: "Animal3", ....: ], ....: "FeedType":["A","B","A","A","B","B","A"], ....: "Amount":[10,7,4,2,5,6,2], ....: } ....: ) ....: In [39]: df.pivot_table(values="Amount",index="Animal",columns="FeedType",aggfunc="sum")Out[39]: FeedType A BAnimal Animal1 10.0 5.0Animal2 2.0 13.0Animal3 6.0 NaN

The second approach is to use the groupby() method:

In [40]: df.groupby(["Animal","FeedType"])["Amount"].sum()Out[40]: Animal FeedTypeAnimal1 A 10 B 5Animal2 A 2 B 13Animal3 A 6Name: Amount, dtype: int64

For more details and examples see the reshaping documentation or the groupby documentation.

factor#

pandas has a data type for categorical data.

cut(c(1,2,3,4,5,6),3)factor(c(1,2,3,2,2,3))

In pandas this is accomplished with pd.cut and astype("category"):

In [41]: pd.cut(pd.Series([1,2,3,4,5,6]),3)Out[41]: 0 (0.995, 2.667]1 (0.995, 2.667]2 (2.667, 4.333]3 (2.667, 4.333]4 (4.333, 6.0]5 (4.333, 6.0]dtype: categoryCategories (3, interval[float64, right]): [(0.995, 2.667] < (2.667, 4.333] < (4.333, 6.0]]In [42]: pd.Series([1,2,3,2,2,3]).astype("category")Out[42]: 0 11 22 33 24 25 3dtype: categoryCategories (3, int64): [1, 2, 3]

For more details and examples see categorical introduction and the API documentation. There is also a documentation regarding the differences to R’s factor.