plotly.py/doc/unconverted/python/big-data-analytics-with-pandas-and-sqlite.md at main · plotly/plotly.py

jupyter

jupytext

kernelspec

plotly

notebook_metadata_filter

text_representation

all

extension	format_name	format_version	jupytext_version
.md	markdown	1.1	1.1.1

display_name	language	name
Python 2	python	python2

description	display_as	has_thumbnail	language	layout	name	order	page_type	permalink	redirect_from	thumbnail
A primer on out-of-memory analytics of large datasets with Pandas, SQLite, and IPython notebooks.	databases	false	python	base	Big Data Analytics with Pandas and SQLite	4	example_index	python/big-data-analytics-with-pandas-and-sqlite/	ipython-notebooks/big-data-analytics-with-pandas-and-sqlite/	/images/static-image

New to Plotly?

Plotly's Python library is free and open source! Get started by downloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!

Version Check

Plotly's python package is updated frequently. Run pip install plotly --upgrade to use the latest version.

importplotlyplotly.__version__

Imports

This notebook explores a 3.9Gb CSV file containing NYC's 311 complaints since 2003. It's the most popular data set in NYC's open data portal. This is a primer on out-of-memory data analysis with

pandas: A library with easy-to-use data structures and data analysis tools. Also, interfaces to out-of-memory databases like SQLite.
IPython notebook: An interface for writing and sharing python code, text, and plots.
SQLite: An self-contained, server-less database that's easy to set-up and query from Pandas.
Plotly: A platform for publishing beautiful, interactive graphs from Python to the web.

The dataset is too large to load into a Pandas dataframe. So, instead we'll perform out-of-memory aggregations with SQLite and load the result directly into a dataframe with Panda's iotools. It's pretty easy to stream a CSV into SQLite and SQLite requires no setup. The SQL query language is pretty intuitive coming from a Pandas mindset.

importplotly.toolsastlstls.embed('https://plot.ly/~chris/7365')

importpandasaspdfromsqlalchemyimportcreate_engine# database connectionimportdatetimeasdtfromIPython.displayimportdisplayimportplotly.plotlyaspy# interactive graphingimportplotly.graph_objsasgo

Import the CSV data into SQLite

Load the CSV, chunk-by-chunk, into a DataFrame
Process the data a bit, strip out uninteresting columns
Append it to the SQLite database

pd.read_csv('311_100M.csv', nrows=2).head()

!wc-l<311_100M.csv# Number of lines in dataset

disk_engine=create_engine('sqlite:///311_8M.db') # Initializes database with filename 311_8M.db in current directory

start=dt.datetime.now() chunksize=20000j=0index_start=1fordfinpd.read_csv('311_100M.csv', chunksize=chunksize, iterator=True, encoding='utf-8'): df=df.rename(columns={c: c.replace(' ', '') forcindf.columns}) # Remove spaces from columnsdf['CreatedDate'] =pd.to_datetime(df['CreatedDate']) # Convert to datetimesdf['ClosedDate'] =pd.to_datetime(df['ClosedDate']) df.index+=index_start# Remove the un-interesting columnscolumns= ['Agency', 'CreatedDate', 'ClosedDate', 'ComplaintType', 'Descriptor', 'CreatedDate', 'ClosedDate', 'TimeToCompletion', 'City'] forcindf.columns: ifcnotincolumns: df=df.drop(c, axis=1) j+=1print'{} seconds: completed {} rows'.format((dt.datetime.now() -start).seconds, j*chunksize) df.to_sql('data', disk_engine, if_exists='append') index_start=df.index[-1] +1

`LIMIT` the number of rows that are retrieved

df=pd.read_sql_query('SELECT Agency, Descriptor FROM data LIMIT 3', disk_engine) df=pd.read_sql_query('SELECT ComplaintType, Descriptor, Agency ''FROM data ''LIMIT 10', disk_engine)

Filter rows with `WHERE`

df=pd.read_sql_query('SELECT ComplaintType, Descriptor, Agency ''FROM data ''WHERE Agency = "NYPD" ''LIMIT 10', disk_engine)

Filter multiple values in a column with `WHERE` and `IN`

df=pd.read_sql_query('SELECT ComplaintType, Descriptor, Agency ''FROM data ''WHERE Agency IN ("NYPD", "DOB")''LIMIT 10', disk_engine)

Find the unique values in a column with `DISTINCT`

df=pd.read_sql_query('SELECT DISTINCT City FROM data', disk_engine)

Query value counts with `COUNT(*)` and `GROUP BY`

df=pd.read_sql_query('SELECT Agency, COUNT(*) as `num_complaints`''FROM data ''GROUP BY Agency ', disk_engine)

Order the results with `ORDER` and `-`

Housing and Development Dept receives the most complaints

df=pd.read_sql_query('SELECT Agency, COUNT(*) as `num_complaints`''FROM data ''GROUP BY Agency ''ORDER BY -num_complaints', disk_engine) py.iplot([go.Bar(x=df.Agency, y=df.num_complaints)], filename='311/most common complaints by agency')

Heat / Hot Water is the most common complaint

df=pd.read_sql_query('SELECT ComplaintType, COUNT(*) as `num_complaints`, Agency ''FROM data ''GROUP BY `ComplaintType` ''ORDER BY -num_complaints', disk_engine) most_common_complaints=df# used laterpy.iplot({ 'data': [go.Bar(x=df['ComplaintType'], y=df.num_complaints)], 'layout': { 'margin': {'b': 150}, # Make the bottom margin a bit bigger to handle the long text'xaxis': {'tickangle': 40}} # Angle the labels a bit }, filename='311/most common complaints by complaint type')

This graph is interactive. Click-and-drag horizontally to zoom, shift-click to pan, double click to autoscale

What's the most common complaint in each city?

First, let's see how many cities are recorded in the dataset

len(pd.read_sql_query('SELECT DISTINCT City FROM data', disk_engine))

Yikes - let's just plot the 10 most complained about cities

df=pd.read_sql_query('SELECT City, COUNT(*) as `num_complaints` ''FROM data ''GROUP BY `City` ''ORDER BY -num_complaints ''LIMIT 10 ', disk_engine)

Flushing and FLUSHING, Jamaica and JAMAICA... the complaints are case sensitive.

Perform case insensitive queries with `GROUP BY` with `COLLATE NOCASE`

df=pd.read_sql_query('SELECT City, COUNT(*) as `num_complaints` ''FROM data ''GROUP BY `City` ''COLLATE NOCASE ''ORDER BY -num_complaints ''LIMIT 11 ', disk_engine) cities=list(df.City) cities.remove(None) traces= [] # the series in the graph - one trace for each cityforcityincities: df=pd.read_sql_query('SELECT ComplaintType, COUNT(*) as `num_complaints` ''FROM data ''WHERE City = "{}" COLLATE NOCASE ''GROUP BY `ComplaintType` ''ORDER BY -num_complaints'.format(city), disk_engine) traces.append(go.Bar(x=df['ComplaintType'], y=df.num_complaints, name=city.capitalize())) py.iplot({'data': traces, 'layout': go.Layout(barmode='stack', xaxis={'tickangle': 40}, margin={'b': 150})}, filename='311/complaints by city stacked')

You can also click on the legend entries to hide/show the traces. Click-and-drag to zoom in and shift-drag to pan.

Now let's normalize these counts. This is super easy now that this data has been reduced into a dataframe.

fortraceintraces: trace['y'] =100.*trace['y']/sum(trace['y']) py.iplot({'data': traces, 'layout': go.Layout( barmode='group', xaxis={'tickangle': 40, 'autorange': False, 'range': [-0.5, 16]}, yaxis={'title': 'Percent of Complaints by City'}, margin={'b': 150}, title='Relative Number of 311 Complaints by City') }, filename='311/relative complaints by city', validate=False)

New York is loud
Staten Island is moldy, wet, and vacant
Flushing's muni meters are broken
Trash collection is great in the Bronx
Woodside doesn't like its graffiti

Click and drag to pan across the graph and see more of the complaints.

Part 2: SQLite time series with Pandas

Filter SQLite rows with timestamp strings: `YYYY-MM-DD hh:mm:ss`

df=pd.read_sql_query('SELECT ComplaintType, CreatedDate, City ''FROM data ''WHERE CreatedDate < "2014-11-16 23:47:00" ''AND CreatedDate > "2014-11-16 23:45:00"', disk_engine)

Pull out the hour unit from timestamps with `strftime`

df=pd.read_sql_query('SELECT CreatedDate, ''strftime(\'%H\', CreatedDate) as hour, ''ComplaintType ''FROM data ''LIMIT 5 ', disk_engine)

Count the number of complaints (rows) per hour with `strftime`, `GROUP BY`, and `count(*)`

df=pd.read_sql_query('SELECT CreatedDate, ''strftime(\'%H\', CreatedDate) as hour, ''count(*) as `Complaints per Hour`''FROM data ''GROUP BY hour', disk_engine) py.iplot({ 'data': [go.Bar(x=df['hour'], y=df['Complaints per Hour'])], 'layout': go.Layout(xaxis={'title': 'Hour in Day'}, yaxis={'title': 'Number of Complaints'})}, filename='311/complaints per hour')

Filter noise complaints by hour

df=pd.read_sql_query('SELECT CreatedDate, ''strftime(\'%H\', CreatedDate) as `hour`, ''count(*) as `Complaints per Hour`''FROM data ''WHERE ComplaintType IN ("Noise", ''"Noise - Street/Sidewalk", ''"Noise - Commercial", ''"Noise - Vehicle", ''"Noise - Park", ''"Noise - House of Worship", ''"Noise - Helicopter", ''"Collection Truck Noise") ''GROUP BY hour', disk_engine) py.iplot({ 'data': [go.Bar(x=df['hour'], y=df['Complaints per Hour'])], 'layout': go.Layout(xaxis={'title': 'Hour in Day'}, yaxis={'title': 'Number of Complaints'}, title='Number of Noise Complaints in NYC by Hour in Day' )}, filename='311/noise complaints per hour')

Segregate complaints by hour

complaint_traces= {} # Each series in the graph will represent a complaintcomplaint_traces['Other'] = {} forhourinrange(1, 24): hour_str='0'+str(hour) ifhour<10elsestr(hour) df=pd.read_sql_query('SELECT CreatedDate, ''ComplaintType ,''strftime(\'%H\', CreatedDate) as `hour`, ''COUNT(*) as num_complaints ''FROM data ''WHERE hour = "{}" ''GROUP BY ComplaintType ''ORDER BY -num_complaints'.format(hour_str), disk_engine) complaint_traces['Other'][hour] =sum(df.num_complaints) # Grab the 7 most common complaints for that hourforiinrange(7): complaint=df.get_value(i, 'ComplaintType') count=df.get_value(i, 'num_complaints') complaint_traces['Other'][hour] -=countifcomplaintincomplaint_traces: complaint_traces[complaint][hour] =countelse: complaint_traces[complaint] = {hour: count} traces= [] forcomplaintincomplaint_traces: traces.append({ 'x': range(25), 'y': [complaint_traces[complaint].get(i, None) foriinrange(25)], 'name': complaint, 'type': 'bar' }) py.iplot({ 'data': traces, 'layout': { 'barmode': 'stack', 'xaxis': {'title': 'Hour in Day'}, 'yaxis': {'title': 'Number of Complaints'}, 'title': 'The 7 Most Common 311 Complaints by Hour in a Day' }}, filename='311/most common complaints by hour')

Aggregated time series

First, create a new column with timestamps rounded to the previous 15 minute interval

Then, GROUP BY that interval and COUNT(*)

minutes=15seconds=15*60df=pd.read_sql_query('SELECT CreatedDate, ''datetime((''strftime(\'%s\', CreatedDate) / {seconds}) * {seconds}, \'unixepoch\') interval ''FROM data ''LIMIT 10 '.format(seconds=seconds), disk_engine) minutes=15seconds=minutes*60df=pd.read_sql_query('SELECT datetime((''strftime(\'%s\', CreatedDate) / {seconds}) * {seconds}, \'unixepoch\') interval ,''COUNT(*) as "Complaints / interval"''FROM data ''GROUP BY interval ''ORDER BY interval ''LIMIT 500'.format(seconds=seconds), disk_engine) py.iplot( { 'data': [{ 'x': df.interval, 'y': df['Complaints / interval'], 'type': 'bar' }], 'layout': { 'title': 'Number of 311 Complaints per 15 Minutes' } }, filename='311/complaints per 15 minutes')

hours=24minutes=hours*60seconds=minutes*60df=pd.read_sql_query('SELECT datetime((''strftime(\'%s\', CreatedDate) / {seconds}) * {seconds}, \'unixepoch\') interval ,''COUNT(*) as "Complaints / interval"''FROM data ''GROUP BY interval ''ORDER BY interval'.format(seconds=seconds), disk_engine) py.iplot( { 'data': [{ 'x': df.interval, 'y': df['Complaints / interval'], 'type': 'bar' }], 'layout': { 'title': 'Number of 311 Complaints per Day' } }, filename='311/complaints per day')

References

Find more open data sets on Data.gov and NYC Open Data
Learn how to setup MySql with Pandas and Plotly
Add interactive widgets to IPython notebooks for customized data exploration
Big data workflows with HDF5 and Pandas
Interactive graphing with Plotly

fromIPython.displayimportdisplay, HTMLdisplay(HTML('<link href="//fonts.googleapis.com/css?family=Open+Sans:600,400,300,200|Inconsolata|Ubuntu+Mono:400,700" rel="stylesheet" type="text/css" />')) display(HTML('<link rel="stylesheet" type="text/css" href="http://help.plot.ly/documentation/all_static/css/ipython-notebook-custom.css">')) ! pipinstallgit+https://github.com/plotly/publisher.git--upgradeimportpublisherpublisher.publish( 'sqlite.ipynb', 'python/big-data-analytics-with-pandas-and-sqlite/', 'Big Data Analytics with Pandas and SQLite', 'A primer on out-of-memory analytics of large datasets with Pandas, SQLite, and IPython notebooks.', title='Big Data Workflow with Pandas and SQLite | Plotly', has_thumbnail='false', redirect_from='ipython-notebooks/big-data-analytics-with-pandas-and-sqlite/', language='python', page_type='example_index', display_as='databases', order=4, ipynb='~notebook_demo/1')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

big-data-analytics-with-pandas-and-sqlite.md

big-data-analytics-with-pandas-and-sqlite.md

New to Plotly?

Version Check

Imports

Import the CSV data into SQLite

`LIMIT` the number of rows that are retrieved

Filter rows with `WHERE`

Filter multiple values in a column with `WHERE` and `IN`

Find the unique values in a column with `DISTINCT`

Query value counts with `COUNT(*)` and `GROUP BY`

Order the results with `ORDER` and `-`

Heat / Hot Water is the most common complaint

What's the most common complaint in each city?

Perform case insensitive queries with `GROUP BY` with `COLLATE NOCASE`

Part 2: SQLite time series with Pandas

Filter SQLite rows with timestamp strings: `YYYY-MM-DD hh:mm:ss`

Pull out the hour unit from timestamps with `strftime`

Count the number of complaints (rows) per hour with `strftime`, `GROUP BY`, and `count(*)`

Filter noise complaints by hour

Segregate complaints by hour

Aggregated time series

References

Files

big-data-analytics-with-pandas-and-sqlite.md

Latest commit

History

big-data-analytics-with-pandas-and-sqlite.md

File metadata and controls

New to Plotly?

Version Check

Imports

Import the CSV data into SQLite

LIMIT the number of rows that are retrieved

Filter rows with WHERE

Filter multiple values in a column with WHERE and IN

Find the unique values in a column with DISTINCT

Query value counts with COUNT(*) and GROUP BY

Order the results with ORDER and -

Heat / Hot Water is the most common complaint

What's the most common complaint in each city?

Perform case insensitive queries with GROUP BY with COLLATE NOCASE

Part 2: SQLite time series with Pandas

Filter SQLite rows with timestamp strings: YYYY-MM-DD hh:mm:ss

Pull out the hour unit from timestamps with strftime

Count the number of complaints (rows) per hour with strftime, GROUP BY, and count(*)

Filter noise complaints by hour

Segregate complaints by hour

Aggregated time series

References

`LIMIT` the number of rows that are retrieved

Filter rows with `WHERE`

Filter multiple values in a column with `WHERE` and `IN`

Find the unique values in a column with `DISTINCT`

Query value counts with `COUNT(*)` and `GROUP BY`

Order the results with `ORDER` and `-`

Perform case insensitive queries with `GROUP BY` with `COLLATE NOCASE`

Filter SQLite rows with timestamp strings: `YYYY-MM-DD hh:mm:ss`

Pull out the hour unit from timestamps with `strftime`

Count the number of complaints (rows) per hour with `strftime`, `GROUP BY`, and `count(*)`