Jupyter notebooks finance demo


The goal of this demo is to show how ipython notebooks can be used in conjunction with different datasources (eg: Quandl) and useful python libraries (eg: pandas) to do financial analysis. It will try to slowly introduce new and useful functions for the new python user.

Since oil-equity corr has been all the talk these days (this demo was written in Jan 2016), let's take a look at it!

In [1]:
# PythonAnywhere comes pre-installed with Quandl, so you just need to import it
import Quandl

# first, go to quandl.com and search for the ticker symbol that you want
# let's say we want to look at (continuous) front month crude vs e-mini S&Ps

cl = Quandl.get('CHRIS/CME_CL1')
es = Quandl.get('CHRIS/CME_ES1')
In [2]:
# Quandl.get() returns a pandas dataframe, so you can use all the pandas goodies
# For example, you can use tail to look at the most recent data, just like the unix tail binary!
es.tail()
Out[2]:
Open High Low Last Change Settle Volume Open Interest
Date
2016-03-02 1976.25 1984.75 1966.25 1982.25 5.5 1983.5 1814091 3008297
2016-03-03 1981.75 1992.50 1974.75 1991.50 7.0 1990.5 1541249 3008594
2016-03-04 1991.00 2007.50 1984.00 1994.75 4.5 1995.0 2232860 3018684
2016-03-07 1994.50 2004.50 1984.50 1999.75 4.0 1999.0 1623905 3012243
2016-03-08 1999.00 2000.25 1976.00 1982.50 18.0 1981.0 1928239 3005808
In [3]:
# you can also get statistics
es.describe()
Out[3]:
Open High Low Last Change Settle Volume Open Interest
count 4723.000000 4736.000000 4738.000000 4738.000000 512.000000 4738.000000 4738.000000 4738.000000
mean 1313.871427 1325.847867 1305.396581 1316.367885 13.185547 1316.353480 1076569.155129 1451989.689743
std 310.180613 312.158070 311.821112 312.201487 12.239507 312.167612 956268.606849 1159894.330244
min 674.750000 694.750000 665.750000 676.000000 0.250000 676.000000 0.000000 0.000000
25% 1109.250000 1117.000000 1102.500000 1109.750000 4.000000 1109.750000 182620.250000 193737.500000
50% 1268.500000 1277.875000 1259.500000 1269.500000 9.500000 1269.500000 857774.500000 1351206.000000
75% 1437.000000 1449.500000 1429.000000 1438.687500 19.562500 1438.687500 1728705.250000 2678325.750000
max 2129.250000 2134.000000 2122.750000 2128.750000 100.250000 2128.000000 6285917.000000 3594453.000000

But wait!

What do we have here? Did you notice that the count is different for the different columns?

Let's take a look at what the missing values are:

In [4]:
# select the rows where Open has missing data points
es[es['Open'].isnull()].head()
Out[4]:
Open High Low Last Change Settle Volume Open Interest
Date
2015-11-17 NaN 2063.50 2041.50 2049.75 1.00 2049.0 1610071 2803541
2015-11-27 NaN 2098.25 2081.50 2090.50 2.00 2090.0 653079 2761335
2015-12-01 NaN 2101.50 2083.50 2099.25 20.25 2100.0 1479676 2764688
2015-12-02 NaN 2105.00 2075.00 2083.50 18.50 2081.5 1709808 2759024
2015-12-09 NaN 2079.75 2034.25 2045.25 16.75 2042.0 2660114 2624311

Hmmm. Time to spend money and buy good data?

Eh. We really only need the daily close here anyways (ie. the settle column). Let's zoom in on that.

In [5]:
es_close = es.Settle  # WHAT IS THIS SORCERY? Attribute access!
es_close.head()
Out[5]:
Date
1997-09-09    934
1997-09-10    915
1997-09-11    908
1997-09-12    924
1997-09-15    922
Name: Settle, dtype: float64
In [6]:
print(type(es))
print(type(es_close))
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>

Oh ok. A column of a DataFrame is a Series (also a pandas object).

Note that it is still linked to the DataFrame (ie. changing the Series will change the DataFrame as well)

In [7]:
# Okay- time to quickly check the crude time series as well
cl.describe()
Out[7]:
Open High Low Last Change Settle Volume Open Interest
count 8273.000000 8275.000000 8275.000000 8275.000000 518.000000 8275.000000 8275.000000 8274.000000
mean 41.777084 42.333422 41.186375 41.777544 1.025753 41.777361 112520.504411 125494.688301
std 29.568643 29.946338 29.134537 29.559547 0.871786 29.559366 123411.769025 110104.481053
min 10.000000 11.020000 9.750000 10.420000 0.010000 10.420000 0.000000 0.000000
25% 19.600000 19.790000 19.400000 19.610000 0.400000 19.610000 30618.000000 46379.500000
50% 28.300000 28.650000 27.970000 28.320000 0.810000 28.320000 58659.000000 87553.000000
75% 61.290000 62.150000 60.485000 61.365000 1.470000 61.365000 166439.000000 176041.500000
max 145.190000 147.270000 143.220000 145.290000 7.540000 145.290000 824242.000000 608831.000000
In [8]:
# Hmm. That's a lot more counts. Does the crude time series start earlier than e-mini's?
cl.head()
Out[8]:
Open High Low Last Change Settle Volume Open Interest
Date
1983-03-30 29.01 29.56 29.01 29.40 NaN 29.40 949 470
1983-03-31 29.40 29.60 29.25 29.29 NaN 29.29 521 523
1983-04-04 29.30 29.70 29.29 29.44 NaN 29.44 156 583
1983-04-05 29.50 29.80 29.50 29.71 NaN 29.71 175 623
1983-04-06 29.90 29.92 29.65 29.90 NaN 29.90 392 640
In [9]:
earliest_es_date = es.index[0]

# at first glance, you could just do
cl[earliest_es_date:].head()
Out[9]:
Open High Low Last Change Settle Volume Open Interest
Date
1997-09-09 19.43 19.61 19.37 19.42 NaN 19.42 32299 88070
1997-09-10 19.57 19.57 19.35 19.42 NaN 19.42 41858 86872
1997-09-11 19.49 19.72 19.30 19.37 NaN 19.37 52342 80434
1997-09-12 19.42 19.47 19.27 19.32 NaN 19.32 28540 80440
1997-09-15 19.29 19.38 19.23 19.27 NaN 19.27 31610 76590
In [10]:
# but just in case there is no matching precise date, we can also take the closest date:
closest_row = cl.index.searchsorted(earliest_es_date)
cl_close = cl.iloc[closest_row:].Settle
cl_close.head()
Out[10]:
Date
1997-09-09    19.42
1997-09-10    19.42
1997-09-11    19.37
1997-09-12    19.32
1997-09-15    19.27
Name: Settle, dtype: float64
In [11]:
# ok lets just plot this guy
import matplotlib
import matplotlib.pyplot as plt
# use new pretty plots
matplotlib.style.use('ggplot')
# get ipython notebook to show graphs
%pylab inline

es_close.plot()
Populating the interactive namespace from numpy and matplotlib
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe1ef10cf28>

That was satisfying- our all too familar S&P chart. Let's try to plot both S&P and oil in the same graph.

In [12]:
plt.figure()
es_close.plot()
cl_close.plot()
plt.yscale('log')

Meh. Okay. LET's ACTUALLY DO SOME MATH!

(... ahem. stats)

In [13]:
es['Settle'].corr(cl['Settle'])
Out[13]:
0.34046181767351186

okay... MOAR GRAPHS! I hear you say

In [14]:
import pandas as pd
pd.rolling_corr(es_close, cl_close, window=252).dropna()
# why 252? because that's the number of trading days in a year
Out[14]:
Date
2014-11-21   -0.369646
2014-11-24   -0.386641
2014-11-25   -0.404232
2014-11-26   -0.421947
2014-11-28   -0.441608
2014-12-01   -0.457250
2014-12-02   -0.473815
2014-12-03   -0.488969
2014-12-04   -0.502630
2014-12-05   -0.516329
2014-12-08   -0.526947
2014-12-09   -0.536849
2014-12-10   -0.541803
2014-12-11   -0.548337
2014-12-12   -0.549632
2014-12-15   -0.549979
2014-12-16   -0.546815
2014-12-17   -0.551038
2014-12-18   -0.560348
2014-12-19   -0.569132
2014-12-22   -0.577833
2014-12-23   -0.586649
2014-12-24   -0.594898
2014-12-26   -0.603125
2014-12-29   -0.610925
2014-12-30   -0.617783
2014-12-31   -0.622196
2015-01-02   -0.626952
2015-01-05   -0.628422
2015-01-06   -0.627138
                ...   
2016-01-26    0.613887
2016-01-27    0.620853
2016-01-28    0.626633
2016-01-29    0.630797
2016-02-01    0.636777
2016-02-02    0.644177
2016-02-03    0.650006
2016-02-04    0.655308
2016-02-05    0.661402
2016-02-08    0.668422
2016-02-09    0.676663
2016-02-10    0.684008
2016-02-11    0.692094
2016-02-12    0.697270
2016-02-16    0.701433
2016-02-17    0.703887
2016-02-18    0.706600
2016-02-19    0.709578
2016-02-22    0.711551
2016-02-23    0.714437
2016-02-24    0.717001
2016-02-25    0.718281
2016-02-26    0.720613
2016-02-29    0.722511
2016-03-01    0.723214
2016-03-02    0.723336
2016-03-03    0.722921
2016-03-04    0.722644
2016-03-07    0.722566
2016-03-08    0.722830
Name: Settle, dtype: float64

That's weird. You'd expect the first year to drop out (because the rolling correlation window starts after the first year), but it should have started after Sept 1998. Instead it is starting in 2014...

In [15]:
print(len(cl_close))
print(len(es_close))
4646
4738
In [16]:
merged = pd.concat({'es': es_close, 'cl': cl_close}, axis=1)
# maybe this is the culprit?
merged[merged['cl'].isnull()].head()
Out[16]:
cl es
Date
1997-11-27 NaN 959.50
1997-11-28 NaN 955.00
1998-01-19 NaN 972.25
1998-02-16 NaN 1019.00
1998-05-25 NaN 1116.50
In [17]:
merged.dropna(how='any', inplace=True)
# BAD DATA BEGONE!
merged[merged['cl'].isnull()]
Out[17]:
cl es
Date
In [18]:
pd.rolling_corr(merged.es, merged.cl, window=252).dropna().plot()
plt.axhline(0, color='k')
Out[18]:
<matplotlib.lines.Line2D at 0x7fe1e2e5fef0>

Brilliant! But this is still quite inconclusive in terms of equity/crude corr. Why? Well we are forgetting about one HUGE HUGE factor affecting correlation here.

In [19]:
# D'oh
import numpy as np
print('Autocorrelation for a random series is {:.3f}'.format(
    pd.Series(np.random.randn(100000)).autocorr())
)
print('But, autocorrelation for S&P is {:3f}'.format(es_close.autocorr()))
Autocorrelation for a random series is -0.003
But, autocorrelation for S&P is 0.998803

So that's why we should look at %-change instead of $-close or $-change...

In [20]:
daily_returns = merged.pct_change()
rolling_correlation = pd.rolling_corr(daily_returns.es, daily_returns.cl, window=252).dropna()
rolling_correlation.plot()
plt.axhline(0, color='k')
title('Rolling 1 yr correlation between Oil and S&P')
Out[20]:
<matplotlib.text.Text at 0x7fe1e2c89ba8>

Great. Now this is much more interesting. It is quite clear that the period of higher correlation in oil prices came after 2009. Qualitatively, we know (if you worked in finance back then) that this was the case: previously, extreme high oil prices (over $100/bbl) were seen as a drag on the economy. Nowadays, extreme low oil prices are seen as an indication of weakness in global demand, with oil prices, equity, credit etc all selling off hand in hand when there is risk off sentiment.

Let's plot some pretty graphs to show what we know qualitatively, and make sure our memory was correct.

In [21]:
# vertically split into two subplots, and align x-axis
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
fig.suptitle('Checking our intuition about correlation', fontsize=14, fontweight='bold')
# make space for the title
fig.subplots_adjust(top=0.85)

rolling_correlation.plot(ax=ax1)
ax1.set_title('Rolling correlation of WTI returns vs S&P returns')
ax1.axhline(0, color='k')
ax1.tick_params(
    which='both',  # both major and minor ticks
    bottom='off', top='off', right='off',
    labelbottom='off'  # labels along the bottom edge are off
)

cl_close.plot(ax=ax2)
ax2.set_title('Price of front month WTI crude')
ax2.tick_params(which='both', top='off', right='off')
ax2.tick_params(which='minor', bottom='off')
ax2.yaxis.set_major_locator(MaxNLocator(5)) # how many ticks

Alright, fine. So we can distinctly see the regime change starting from the European debt crisis, when oil came back down from $150/bbl. Traders no longer saw high oil prices as a drag on the economy, and instead focused on their intention on global demand instead as we entered a period of slow growth.

Also, all the recent talk about equity oil correlation, we have actually seen higher correlations in the 2011-2013 period.

So this is an interesting observation. But as data scientists, we must test this hypothesis! If the cause of this recent spike in equity/crude corr is really driven by risk off sentiment, let's see if there is also much stronger cross asset correlation in other risk assets. Stay tuned for the next part of this series!

comments powered by Disqus