1
\$\begingroup\$

I retrieved Bloomberg data using the Excel API. In the typical fashion, the first row contains tickers in every fourth column, and the second row has the labels Date, PX_LAST, [Empty Column], Date, PX_LAST, etc. The following rows have dates and last price.

 EHFI38 Index BBGID, , , EHFI139 Index BBGID, , ... Date , PX_LAST , , Date , PX_LAST , ... 1999-12-31 , 100.0000 , , 1999-12-31 , 100.0000 , ... 2000-01-31 , 100.1518 , , 2000-01-31 , 98.6526 , ... ... 

It seems that the proper data structure would be a DataFrame with dates as the index, and tickers as the column names.

 , Date, EHFI38 Index BBGID, EHFI139 Index BBGID, EHFI139 Index BBGID, EHFI84 Index BBGID, ... 0, 1999-12-31, 100.0000 , 100.0000 , 100.0000 , 100.0000, ... 1, 2000-01-31, 100.1518 , 98.6526 , 98.6526 , 104.7575, ... ... 

I wrote this code, which seems to work when I step through it, but I'm sure I'm not doing it well. I'd like to learn how to do it better.

# IMPORT import pandas as pd import numpy as np import datetime # READ IN CSV FILES # EHFI38 Index BBGID, , , EHFI139 Index BBGID, , ... # Date , PX_LAST , , Date , PX_LAST , ... # 1999-12-31 , 100.0000 , , 1999-12-31 , 100.0000 , ... # 2000-01-31 , 100.1518 , , 2000-01-31 , 98.6526 , ... # ... px = pd.read_csv('Book1.csv', sep=',', parse_dates=True) # REMOVE EMPTY COLUMNS px = px.dropna(axis=1, how='all') # CONVERT TO ARRAYS M = np.array(px) C = np.array(px.columns) # FIX UNNAMED COLUMNS IN C for i in arange( len(C)/2 ) * 2: C[i+1] = C[i] # CONVERT EXCEL DATES FUNCTION (THANKS JOHN MACHIN) def xl2pydate(xldate, datemode): # datemode: 0 for 1900-based, 1 for 1904-based return ( datetime.datetime(1899, 12, 30) + datetime.timedelta(days=xldate + 1462 * datemode) ) # CONVERT DATES THE UGLY WAY # LOOP THROUGH 1,2, ... last row for i in arange( len(M)-1 ) + 1: # LOOP THROUGH 0,2, ... last column-1 for j in arange( len(M.T)/2 ) * 2: # CONVERT DATE & STORE if isinstance(M[i,j],str) and M[i,j].isdigit(): M[i,j] = xl2pydate(int(M[i,j]), 0) else: M[i,j] = NaN # RECOMBINE IN A DATAFRAME df = pd.DataFrame(M[1:,:], columns=[C,M[0,:]]) # MERGE DATES # , Date, EHFI38 Index BBGID, EHFI139 Index BBGID, EHFI139 Index BBGID, EHFI84 Index BBGID, ... # 0, 1999-12-31, 100.0000 , 100.0000 , 100.0000 , 100.0000, ... # 1, 2000-01-31, 100.1518 , 98.6526 , 98.6526 , 104.7575, ... # ... # LOOP 0,2,...,len-1 for i in arange( (len(df.T))/2 ) * 2: # GET A DATE, LAST_PX FOR A SINGLE TICKER b = df[df.columns[i:(i+2)]] # CHANGE COLUMN NAMES TO DATE, [TICKER] b.columns = (df.columns[i][1], df.columns[i][0]) # COMBINE if i==0: a = b else: a = pd.merge(a.dropna(), b.dropna(), on='Date', how='outer') 
\$\endgroup\$
1

1 Answer 1

4
\$\begingroup\$

You can probably do most of what you want in native Pandas. It has functions for excel file IO that will probably take care of much of the date-munging.

If you want to carry on with the intermediate .csv file, the following should help.

Because you have an empty column and 3 commas between EHFI38 Index BBGID and EHFI139 Index BBGID, your data are in a slightly strange format.

import pandas as pd from cStringIO import StringIO data = """\ EHFI38 Index BBGID, , , EHFI139 Index BBGID, Date , PX_LAST , , Date , PX_LAST 1999-12-31 , 100.0000 , , 1999-12-31 , 100.0000 2000-01-31 , 100.1518 , , 2000-01-31 , 98.6526 """ df = pd.read_csv(StringIO(data), header=[0, 1], parse_dates=True) df 

There should be a way to sort out the strange indexing but I cannot easily work it out. Try searching for pandas multi-index and hierarchical data.

\$\endgroup\$
0

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.