I have the following data stored in HDFS: each row has three columns, id, date, item, which means a person with a particular id bought a particular item on a particular date. The dataset has billions of rows and all rows are distinct. I can query this table by hive.
Now, I want to read this table in memory and transform this table to rdd so that I can use spark to process it.
Particularly, I want to transform the table in a form similar to a 3 dimensional ndarray in python's numpy: $X_{ijk}$ = 1 if person $i$ bought item $k$ on date $j$, otherwise $X_{ijk}=0$. In other words, I want an rdd such that each record is a matrix with 0s and 1s.
How can I achieve this in python?