Today's upgrade - postgres price drop, mysql scaling improvements

Nothing earth-shattering to report today, but some good news:

Postgres is cheaper

It's been over a year since we first tentatively launched our postgres service, and we've found that we're able to optimise the service so that it scales better than we thought, so we're pleased to pass on the cost savings to you.

Postgres is now $7/month instead of $15. That price will apply to all new plans and upgrades, and we'll also start applying the new price to existing users for their next bill. So that more moolah in your pockets dear users, don't spend it all at once ;)

MySQL infrastructure changes

These changes won't really be very visible from the user point of view, so this isn't very interesting to you, beloved readers, per se, but it took us loads of time and effort so we have to say something to make it all feel worthwhile and satisfy our own egos. Anyways, we made some changes to the way we shard users amongst MySQL servers in our clusters, which mean it's now much easier for us to add extra MySQL capacity whenever we want to.

For the curious, did you know that (depending on your OS and config), filesystem limits on the number of hard links in a single directory might limit you to a maximum of 32,000 databases on a single mysql instance? Not that we ever came anywhere near that, but still, good to know. #tipsforpaasproviders.

Python.org console now Python 3.5

Our live consoles on the python.org front page are now Python 3.5 instead of 3.4. We've also made them "regular" Python consoles instead of IPython (which was always a slightly weird decision, even though IPython is all awesome and everything, but a regular Python console is what new users are most likely to see, and ours do have tab-completion switched on you know?)

Onwards and upwards folks! In our next iteration we hope to be able to release a first beta of an API for PythonAnywhere. Watch this space :)


Scaling a startup from side project to 20 million hits/month - an interview with railwayapi.com creator Kaustubh

We recently wished farewell to a customer who had been with us for about 18 months, during which time he saw some incredible growth in what was originally just a side project. We spoke to him about how he found the experience of scaling on PythonAnywhere, and why he decided to move on.

railwayapi.com stats
Project started:October 2014
Requests:20 million / month
Active users:1000+

What's your background? How long have you been programming?

I am currently pursuing Bachelor's in Computer Science & Engineering. I have been programming from school days but RailwayAPI was the first substantial project I did.

Can you describe what railwayapi.com does? What first gave you the idea to build a site like this?

It all started with an idea to build an app which let Train travelers in India find the best available route between two stations. It's very difficult to get confirmed bookings in India and so I thought it would be great if there was an app which helps people break their journey up, using multiple trains to reach their destination in minimum time.

While working on that idea I realized that I needed train data to make such a thing possible. There wasn't a reliable API for Indian railways and I realized that several developers would be facing the same problem. And hence RailwayAPI was born.

It is a collection of APIs which let developers access all kinds of Railway data like Seat Availability, Train Route, Live Train status etc in easy-to-use JSON formats.

Why did you choose Python and Flask? Are you happy with your choice?

Python was the language I already knew well and there were several great libraries available for it so the decision was very easy to make for me.

Since the exposed part of the API was just going to be URL views which capture GET requests and return the corresponding JSON, Flask was the most suitable with its minimality which is enough for what I was doing.

What made you choose PythonAnywhere?

This was my first webapp and initially I had very little experience in setting up such an environment. After some research I stumbled upon PythonAnywhere which was extremely easy to set up, with a nice and clean UI. It also lets you instantly scale up your app by just sliding the number of workers you are going to need. After that I needn't go anywhere else!

Your site quickly became one of our busiest sites -- when did you realise it was getting big? How was the experience of scaling up on PythonAnywhere?

I realized that it was getting big when one of the user complained about load balancer errors they were getting. But it wasn't an issue as I quickly scaled up the number of workers and site was back to normal functioning in seconds.

What kind of traffic did you have last month, for example?

The site got about 19.6 million requests last month! And for the month before that it was about 16 million requests.

You've now decided to move on -- why is that, where did you move to, and was it easy to make the transition?

Yes, I have moved to a VPS now at Digital Ocean. It wasn't an easy decision to make for me and it was done only after considerable thought. I loved PythonAnywhere and I continue to love it but I needed more flexible configurations so a VPS was required at the current stage.

It was quite difficult to move to a VPS because I have grown accustomed with the ease of use at PythonAnywhere where everything is pre set up and you can just focus on writing your code instead of writing configurations.

In general, what do you think are the pros and cons of PythonAnywhere?

I think I have already mentioned a lot of Pros about PythonAnywhere. But in short if you just want to focus on building your app rather than setting up environment, which is what ideally it should be than there are very few places like PythonAnywhere out there. Also developers should know that setting up an environment is not a one off process, it requires constant monitoring and changes so that any modification in the code doesn't break the configuration and vice versa. Its quite a time consuming process but PythonAnywhere takes care of all of that.

I would have liked it if PA supported asynchronous workers. In fact this was the main reason for moving out because my application was Network I/O bound and async workers were better suited for such a task, and that wasn't available here, at least for now.

Also it would be great if Redis and Web Sockets were also supported at some point.

It sounds like the site was a big success. What's next for railwayapi.com?

The goal is to give developers access to Indian Railways data without hassle. There are some other ideas I am working on like making available Train and Seat Availability prediction/analytics through API.

Do you have any advice for other aspiring web developers?

I am still learning a lot myself!

Although I could say from my experience that the most important thing I learned while developing the API was that engineering your App to be scalable is one the most challenging tasks, which is often overlooked by people in the beginning.

I had to re-code several of the modules several times because they resembled 'hacky' code but as the pressure on the site grew they started to break. So it would be good if developers also think about how their app would respond to such scenarios in the future.

Thanks again Kaustubh, and best of luck with the future of the project!


Anatomy of a bug

We recently fixed a problem in our website hosting code that was causing weird errors under very specific and rare circumstances. The problem had been there for several years, and -- while we knew that odd stuff happened every now and then -- we'd never been able to reproduce it reliably enough to debug it. But a lucky coincidence of circumstances, when two people tripped over the bug in quick succession, clarified the issues and let us work out the solution. It's an interesting tale, so we thought we'd share it.

The problem

Every now and then -- say, once every six months -- someone would report that they had a Flask website which would inexplicably start returning CSRF errors for every request. The apps in question generally had CSRF protection disabled. The error messages were very generic -- there was no indication of where in the stack they came from. And the problem would go away if the web app was reloaded.

There was no obvious pattern that we could see, apart from the fact that it was always Flask apps. No bugs had been reported against Flask's CSRF protection plugins.

We couldn't for the life of us work out what was causing the problem. Our normal request-processing stack (two layers of nginx for loadbalancing and basic HTTP(S), then uWSGI) doesn't do any CSRF-related stuff, and it sends all headers and requests through unchanged. But no-one else was seeing this problem on other hosts, so it seemed likely that there was something odd going on with our service specifically.

The lucky coincidence

Two things happened: we recently upgraded the version of Django that we use to host the PythonAnywhere website itself to 1.9 (from an embarassingly old version that I won't mention). And then, two people reported the weird CSRF problems in quick succession. One was using Flask, but the other was using Django. So maybe the problem wasn't Flask-specific.

Even more interestingly, they were both getting exactly the same error message -- and it was slightly different to the one that had been coming back before. Furthermore it was one that, on Googling, we could only find in newer versions of the Django codebase. This sounded suspiciously like it was coming from somewhere in our infrastructure -- it had changed when we upgraded, and it was a Django message even for the Flask app. But why would our code -- the Django code that makes up the PythonAnywhere website -- be called when a request was being made to one of our customers' sites?

Website wakeup

PythonAnywhere's web hosting platform is made up of a large number of web servers. Each web server is capable of serving any of the tens of thousands of websites we host. But, of course, the code for only a subset of those sites is running on any given server at any given time. We load-balance across the web servers, sending the traffic for each hosted site to a specific server. If we add another web server when things are busy, it's a simple job to tell the loadbalancer to spread the traffic out differently, so that the new server takes over handling some of the sites.

If one of our webservers gets a request for a website that isn't already running on it, the request gets redirected internally to a view on a PythonAnywhere web app, which checks to make sure that the site is one that we're actually meant to be hosting (rather than some random domain that someone happens to have pointed at one of our IP addresses). If it is, it starts the site up and redirects the incoming request back round again internally to the original URL. Once that's happened, the requester gets the response from the freshly-started website.

This means that when we add a new server, which will be running no websites initially, it will start them up one by one as it receives the first request for each from the loadbalancer.

@csrf_exempt

Once we'd realised that the error that was coming back for our customers' sites was from Django, and had changed when we upgraded Django, it was clear that this "wake up" view in our own code was the only possible culprit. It's the only bit of our code that would ever be run in response to a request for a customer site.

We took a look at the view -- and, it turned out, it wasn't marked as CSRF exempt. Why does that matter?

We use CSRF heavily across our site (to mitigate the risk of a malicious person putting up a website that posted a "delete my web app" request to PythonAnywhere when someone visited it), and CSRF protection is the default -- each view must explicitly opt-out.

Now, "wake up" requests to this view would come in without any CSRF tokens or other related data. CSRF is site-specific, and the wake-up requests were not really requests for our site at all -- they were requests meant for another site that had been temporarily passed over to us so that we could start up the code to handle them.

So it was definitely a bug that the view wasn't marked CSRF exempt.

Importantly, though, CSRF protection is only relevant to POST requests. GET requests don't need it, as they are (so long as you're doing things properly) not state-changing operations.

So we'd worked out part of the problem -- if a website wasn't running on a given server, and a request for that site was sent to the server, then only a GET request would start the site correctly. A POST request would get a spurious CSRF error, and the site would not be started up.

But our customers who had reported the problem were saying that when their websites got into the CSRF-error state, they'd stay in that state regardless of how much traffic they got. And also, what was the Flask connection? While we now had one person seeing the problem with Django, every other report had been against Flask.

Finally, one of our customers who was having the problem told us something that made it all clear. He explained that his site was collecting data from other computers elsewhere on the Internet, which were only ever POSTing to it. He had discovered that when it got into the state when it was always responding with CSRF errors, he didn't need to reload it to fix it. He just needed to log in to the admin page.

Suddenly it all became clear. The lack of the @csrf_exempt decorator on our "wake up the website" view meant that it would always respond to a POST request with a CSRF error. But if it received a GET request (for example, a hit from a browser on the admin login page, or a manual reload from the "Web" tab on PythonAnywhere), it would work properly and start the website. Once that had happened, then all further POST requests would work.

Types of websites

This taught us something interesting about the kind of sites people build. There are two main kinds of sites on PythonAnywhere (and we think on the Internet at large) -- human-readable and computer-readable.

Human-readable sites get a lot of GET requests as people hit them and browse around looking at pictures of cats. They get the occasional POST as someone comments on a cat picture, or sends a message to the cat or its owner, but most hits are GETs.

Computer-readable sites are basically web APIs. They also get quite a few GETs. But sometimes, they're designed to get mostly POSTs. The data-collection site that our customer was building, for example. Or sites that exist entirely to receive webhook messages.

Most of the sites on PythonAnywhere are human-readable, but there are a fair number of computer-readable ones. Of the computer-readable ones, most get a decent number of GET requests. A tiny proportion are APIs that get predominantly POSTs. So while every site on PythonAnywhere would have the problem where it couldn't be woken up by a POST request, only a very small subset of them would be seriously affected -- the ones that got almost nothing but POSTs.

The problem was only affecting that tiny proportion of sites. But the fact that their sites were only handling POSTs wasn't something that was obviously relevant information to the customers who were reporting the problem, so they didn't mention it. And because we didn't know what the cause was, we never knew that it was something we should ask about.

And the Flask connection? Well, if you're building a website that handles lots of POST requests, and you don't need to worry about user management, administration, and user interface design, which framework are you most likely to pick? Flask is almost certainly the most popular framework for people building simple web APIs. So it wasn't quite a red herring that the problem seemed originally to be Flask-specific -- there was a very strong correlation there -- but that correlation really did mislead us into thinking it was some weird interaction with Flask specifically.

The fix

Once we knew what the problem was, it was trivial to fix -- just adding a @csrf_exempt on the correct view -- and we had it patched within an hour, even including testing.

But it was a long and confusing year puzzling over the issue! Hopefully the explanation gives some flavour of the sleuthing process.


The PythonAnywhere newsletter, April 2016

Spring is here, we're filled with good intentions, and here is another newsletter, almost exactly a month after the previous one, which is 800% better than our previous interval.

So other than good intentions, clock changes, and eating too much chocolate, what's been going on? Plenty of stuff it turns out, and it's all for you, dear users:

PythonAnywhere now supports Python 3.5

Python 3.5 has been out of beta since last summer, and in the end we figured that if we wanted to preserve any self-respect whatsoever, it was time to make it available on PythonAnywhere.

You can now use Python 3.5 in web apps, consoles and scheduled tasks. And IPython Notebooks too if you're a paying user!

Python 3.5 in and of itself isn't that exciting -- there's a bunch of syntactic sugar for asyncio (which we don't support for web apps however), there's a matrix multiplication operator, @, which might be useful for a niche audience, and a few nice bugfixes and extensions in pathlib and subprocess and elsewhere:

Here are the official Python 3.5 release notes.

But one of the nice side-effects was that we got to install a fresh stack of packages, so the default version of Django is 1.9 in Python 3.5, and it also has the latest version of requests, and so on. More info on the "batteries included" page.

But you should probably still use a virtualenv for your web apps!

The inside scoop on our forums

Useful tips

New modules

Although you can install Python packages on PythonAnywhere yourself, we like to make sure that we have plenty of batteries included. Here's what we've added since the last newsletter:

Python 3.5

It's new, so we've added all of the packages that we previously supported for Python 3.4 and 3.3. We've installed the most recent versions we could get, though, so many of them are more up-to-date. Django 1.9.3 FTW!

Python 2.7, 3.3 and 3.4

  • pyodbc and its lower-level dependencies, so you should be able to connect to Microsoft SQL Servers elsewhere on the Internet.
  • pypdftk and its dependencies -- now we have three separate PDF libraries!
  • pint
  • uncertainties
  • flask-openid
  • And finally, we've upgraded twilio so that it works properly from free accounts.

Python 3.3 and 3.4 specific

  • mysqlclient (so now Django should work out of the box with Python 3)
  • basemap

New whitelisted sites

Paying PythonAnywhere customers get unrestricted Internet access, but if you're a free PythonAnywhere user, you may have hit problems when writing code that tries to access sites elsewhere on the Internet. We have to restrict you to sites on a whitelist to stop hackers from creating dummy accounts to hide their identities when breaking into other people's websites.

But we really do encourage you to suggest new sites that should be on the whitelist. Our rule is, if it's got an official public API, which means that the site's owners are encouraging automated access to their server, then we'll whitelist it.

Here are some sites we've added since our last newsletter:

  • *.wikidata.org -- like you'd expect, Wikipedia's database.
  • api.api.ai -- speech-to-text
  • *.golang.org and *.googlesource.com so that GoLang developers can run stuff on PythonAnywhree
  • cloud.memsource.com -- a translation platform
  • api.locu.com -- a site to push business listings to a variety of directories
  • cloud.feedly.com -- manage RSS feeds
  • string-db.org -- a protein interaction database
  • api.football-data.org -- exactly what you think it is, unless you're in the US -- it's about soccer.
  • api.fixer.io -- API for the FX rates published by the European Central Bank
  • api.mca.sh -- a Norwegian banking site.
  • eliteprospects.com -- hockey stats
  • qq.com -- various endpoints for one of China's biggest sites
  • overpass-api.de -- German transport data
  • app.box.com and api.box.com -- Dropbox for the enterprise
  • *.mlab.com -- MongoLab's new name
  • mikomos.com -- an online database of places to meet for dates.
  • api.stormpath.com -- an identity API

And that's it

Thanks for reading our newsletter! Tune in the same time next month for more news from PythonAnywhere.


System upgrade, 2016-04-12: Python 3.5

We upgraded PythonAnywhere today. The big story for this release is that we now support Python 3.5.1 everywhere :-) We've put it through extensive testing, but of course it's possible that glitches remain -- please do let us know in the forums or by email if you find any.

There were a few other minor changes -- basically, a bunch of system package installs and upgrades:

  • mysqlclient for Python 3.x (so now Django should work out of the box with Python 3)
  • pyodbc and its lower-level dependencies, so you should be able to connect to Microsoft SQL Servers elsewhere on the Internet.
  • pdftk
  • basemap for Python 3.x.
  • pint
  • uncertainties
  • flask-openid
  • And finally, we've upgraded Twilio so that it works properly from free accounts.

The PythonAnywhere newsletter, March 2016

Well, it's been nine months since our last newsletter and we've got a lot to tell you... Let's get started.

Cool new stuff part 1: Jupyter/IPython notebooks

Since the end of last year, all paid PythonAnywhere accounts have supported Jupyter/IPython notebooks. If you go to the "Files" tab, you can run existing notebooks, or create new ones. If you do anything involving data analysis, or exploratory interactive coding, they're a must-see.

We're still working out how to provide some kind of access for free users (without breaking the bank with our own server costs) so stay tuned...

Cool new stuff part 2: Education

Do you teach programming to a class of students? Do you have a coach who's helping you learn to code? Or do you just have a bunch of PythonAnywhere accounts and would like to access them all from one login?

With our new education feature, when you're logged into PythonAnywhere you can go to the "Account" tab and then to the "Teacher" section, and enter the username of another account. The person who's logged in to that other account can then "switch modes" so that they can use the site as if they are you. They can see your files, look at your consoles, and so on.

So -- if you're a teacher, next time you're running a class, ask your students to nominate you as their teacher, and you'll be able to help them without having to shoulder-surf. Super-useful for remote classes. (If you want us to bulk-create a bunch of accounts for your students beforehand, just get in touch on support@pythonanywhere.com.)

If you're a student, just nominate your coach as your teacher, and they can help you with your coding questions quickly and easily.

And if you have a bunch of PythonAnywhere accounts that you want to access while logged in as a "superuser" account, just log in to each of the other accounts in turn and nominate your main account as the teacher. If you're a web developer using PythonAnywhere to host your customers' sites -- in separate accounts to make billing easier -- then you never need to wonder what the login details for each of the customers are.

There's more information here.

More cool new stuff! Part 3, custom consoles

Do you often need to start a console running a particular script? Maybe there's a metrics script you run once a day to find out how much money your wildly-successful website has made over the last 24 hours :-) Or maybe you just want to download some data to update your site.

On the "Consoles" tab, you can add a custom console to do whatever you want. Click on the little "+" icon next to "Custom", and you can enter a name (for you to recognise it by) and a bash command, or a path to a script. Click the checkmark, and you'll have a new custom script on your "Consoles" tab. From now on, every time you want to launch your script, it's just a click away.

Give it a go -- you'll be surprised how many helpful scripts you wind up adding.

More about custom consoles in this blog post.

Yet more cool new stuff! Part 4, web app hit counting

How busy is your website? If you have a free account, you can now go to the "Web" tab and see how many hits you've had in the last hour, day or month -- and comparable numbers for last month. And if you're a paying customer, you get pretty live charts you can zoom into and analyse in depth :-)

Pretty pictures here.

Even more cool new stuff! Part 5, a better editor

We've made our in-browser editor (the one you get if you click on a .py file in the "Files" tab) much better. Many thanks for everyone for the suggestions! There are two really noticeable changes:

  • The console that shows the results of your code when you click the "Save and run" button is no longer in a popup tab -- it's right there in the editor. No more problems with popup blockers, or having to click back and forth between tabs.
  • "Save as" -- it's kind of silly that our editor didn't have this. Now it does :-)

Stuff that isn't really very cool but you probably need to know!

  • Since day one, we've provided a MySQL database for everyone. The hostname we suggested for accessing it was simply mysql.server. (We have stuff in place so that address can point to different places for different people.) That wasn't working too well, unfortunately -- basically, the stuff to make the same address go to different servers for different people made it kind of slow -- so we've changed the address. If you go to the "Databases" tab and look at the top, in the "Connecting" section, you'll see a "Database host address", which is the one you should use now. That address isn't accessible from outside PythonAnywhere (for security reasons) but it will work inside.

    We are removing support for the mysql.server address in the very near future (probably about a month), so update your config accordingly.

  • Website CNAMEs. If you have a paid account and are using a custom domain, we used to tell you to point a CNAME at yourusername.pythonanywhere.com. This really confused lots of people, because you can also have a website at yourusername.pythonanywhere.com, and that might be showing a completely different site to the one on your custom domain. We've revamped that, and now you can specify a different numeric CNAME value that doesn't host a site at all. We recommend you change over -- though, again, we'll continue to support the old-style CNAME for the time being.

From the forums and our blog

New modules

Although you can install Python packages on PythonAnywhere yourself, we like to make sure that we have plenty of batteries included. Here's what we've added since the last newsletter:

Python 2.7

plotly (1.9.3)
Upgraded tweepy to 3.5.0 (so that it works with Twitter's latest API)

Python 3.3 and 3.4

We added around 150 packages so that Python 3 packages match (as much as possible) the packages that are available for Python 2.

Here are some highlights, but you can visit the complete list for more details.

GitPython (1.0.1)
google-api-python-client (1.4.1)
pycurl (7.19.5.1)
pyflakes (0.9.2)
pyspotify (2.0.0)
tweepy (3.3.0)
xlrd (0.9.3)
xlwt (1.0.0)

New whitelisted sites

If you're a free PythonAnywhere user, you may have hit problems when writing code that tries to access sites elsewhere on the Internet. We have to restrict you to sites on a whitelist -- ones with an official public API -- to stop hackers from using us as a one-stop-shop to hide their identities when doing nefarious things. But we keep adding stuff to the whitelist; since our last newsletter, we've added over 100 new sites, but here are some that you may recognise:

api.coursera.org
api.dropboxapi.com
*.federalreserve.gov
api.pushover.net
api.stripe.com
data.sparkfun.com
soundcloud.com

And that's it!

Thanks for reading, and tune in next month for another exciting newsletter from PythonAnywhere.


Webapps and scheduled task expiries

tl;dr: for free accounts, web apps and scheduled tasks will stop running after a while if you don't log in. We'll email you a warning before this happens. Here's why:

Loads of people create free Python websites on PythonAnywhere and this is a really cool thing. Some of these websites are active ones where people are hosting their personal stuff, doing academic things, etc., and they want to keep them running. This is awesome! We want people to do that and we're happy to host this stuff for free.

For other use cases, some people may setup a webapp to try out a new web framework. Their owners may not intend to keep them running forever. That's fine too! We're glad to help people learn.

The problem for us is that we can't tell which is which.

Search engines such as Google and Baidu, and other web crawlers continuously index and hit all of the free websites, so they all look active, even the ones that nobody is using. We also didn't have any mechanism to tell which scheduled tasks were still important to their owners.

We don't want to reduce the service that we offer for free. We think it's an important way to give back to the community (and of course lots of free users start paying us after a while -- the record is a free user who started paying us after three years! -- so we have an incentive there too ;-)

On the other hand, PythonAnywhere is also very focused on offering a long term, sustainable service. In order to be a long term solution for you, we need a way to avoid accumulating dead code that will just grow and grow with no way of reducing it.

So, we've added a way for people to say "I'm still interested in keeping this web app/task running". Every three months (for web apps) or four weeks (for scheduled tasks) we'll email you to check. If you want to keep it up and running, there's a link in the email to click that will make sure it's all good for another three months or four weeks as appropriate. If you don't, you can just ignore the email.

If you miss the expiry email, all is not lost -- we won't delete anything. All your files, webapp setup, tasks and task logs will still be kept, but they just won't be actively running. You just need to login and click a button to re-enable them, and they will be extended by 3 months/4 weeks again.

We hope this works out OK, and helps us avoid running stuff that nobody wants anymore, which adds unnecessary congestion to the servers and ultimately increases the costs we have to charge our paying customers. Hopefully, it is a good balance between the two goals of being long term sustainable and making sure that we can continue to host stuff that people actually want for free.

Reach out to us if you have any thoughts!


Deprecation warning: "mysql.server" hostname being retired

Relax everyone! We're not switching off our mysql service, just switching off one of the old names it was available under. You'll still be able to access it, and more reliably, under the new name, with no downtime. Details follow...

Action is required if you set up your mysql instance over a year ago

The old mysql.server proxy service is being shut down

We originally set up a local mysql-proxy instance on each server, which would forward traffic to our actual database servers. It was available locally under the hostname mysql.server, which we inject into people's hosts files. We decided this service wasn't as reliable as we'd like, and have been using custom DNS routing instead.

For the last 12 months or so we've been telling people to use the new hostnames, so you only need to worry if you set up your mysql connection a long time ago (maximum respect to our OG users by the way!)

Use the new myusername.mysql.pythonanywhere-services.com address from your Databases tab

Head on over to your Databases tab (available from the dashboard) and you'll find the new hostname you should be using. Then, scrobble away in your settings.py, or DAL.py, or wherever it is that you store your mysql settings, bounce your web app or application, kick the tyres, and you're good to go.

You have 15 seconds to comply

Get it done soon! We expect to retire the mysql.server service at our next-but-one deployment, which could be as little as two weeks away, so get it done. Why not do it now? Just drop us a line via support@pythonanywhere.com if you need any help.


Jupyter notebooks finance demo

ipython-demo

The goal of this demo is to show how ipython notebooks can be used in conjunction with different datasources (eg: Quandl) and useful python libraries (eg: pandas) to do financial analysis. It will try to slowly introduce new and useful functions for the new python user.

Since oil-equity corr has been all the talk these days (this demo was written in Jan 2016), let's take a look at it!

In [1]:
# PythonAnywhere comes pre-installed with Quandl, so you just need to import it
import Quandl

# first, go to quandl.com and search for the ticker symbol that you want # let's say we want to look at (continuous) front month crude vs e-mini S&Ps

cl = Quandl.get('CHRIS/CME_CL1') es = Quandl.get('CHRIS/CME_ES1')

In [2]:
# Quandl.get() returns a pandas dataframe, so you can use all the pandas goodies
# For example, you can use tail to look at the most recent data, just like the unix tail binary!
es.tail()

Out[2]:
Open High Low Last Change Settle Volume Open Interest
Date
2016-03-02 1976.25 1984.75 1966.25 1982.25 5.5 1983.5 1814091 3008297
2016-03-03 1981.75 1992.50 1974.75 1991.50 7.0 1990.5 1541249 3008594
2016-03-04 1991.00 2007.50 1984.00 1994.75 4.5 1995.0 2232860 3018684
2016-03-07 1994.50 2004.50 1984.50 1999.75 4.0 1999.0 1623905 3012243
2016-03-08 1999.00 2000.25 1976.00 1982.50 18.0 1981.0 1928239 3005808

In [3]:
# you can also get statistics
es.describe()

Out[3]:
Open High Low Last Change Settle Volume Open Interest
count 4723.000000 4736.000000 4738.000000 4738.000000 512.000000 4738.000000 4738.000000 4738.000000
mean 1313.871427 1325.847867 1305.396581 1316.367885 13.185547 1316.353480 1076569.155129 1451989.689743
std 310.180613 312.158070 311.821112 312.201487 12.239507 312.167612 956268.606849 1159894.330244
min 674.750000 694.750000 665.750000 676.000000 0.250000 676.000000 0.000000 0.000000
25% 1109.250000 1117.000000 1102.500000 1109.750000 4.000000 1109.750000 182620.250000 193737.500000
50% 1268.500000 1277.875000 1259.500000 1269.500000 9.500000 1269.500000 857774.500000 1351206.000000
75% 1437.000000 1449.500000 1429.000000 1438.687500 19.562500 1438.687500 1728705.250000 2678325.750000
max 2129.250000 2134.000000 2122.750000 2128.750000 100.250000 2128.000000 6285917.000000 3594453.000000

But wait!

What do we have here? Did you notice that the count is different for the different columns?

Let's take a look at what the missing values are:

In [4]:
# select the rows where Open has missing data points
es[es['Open'].isnull()].head()

Out[4]:
Open High Low Last Change Settle Volume Open Interest
Date
2015-11-17 NaN 2063.50 2041.50 2049.75 1.00 2049.0 1610071 2803541
2015-11-27 NaN 2098.25 2081.50 2090.50 2.00 2090.0 653079 2761335
2015-12-01 NaN 2101.50 2083.50 2099.25 20.25 2100.0 1479676 2764688
2015-12-02 NaN 2105.00 2075.00 2083.50 18.50 2081.5 1709808 2759024
2015-12-09 NaN 2079.75 2034.25 2045.25 16.75 2042.0 2660114 2624311

Hmmm. Time to spend money and buy good data?

Eh. We really only need the daily close here anyways (ie. the settle column). Let's zoom in on that.

In [5]:
es_close = es.Settle  # WHAT IS THIS SORCERY? Attribute access!
es_close.head()

Out[5]:
Date
1997-09-09    934
1997-09-10    915
1997-09-11    908
1997-09-12    924
1997-09-15    922
Name: Settle, dtype: float64

In [6]:
print(type(es))
print(type(es_close))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>

Oh ok. A column of a DataFrame is a Series (also a pandas object).

Note that it is still linked to the DataFrame (ie. changing the Series will change the DataFrame as well)

In [7]:
# Okay- time to quickly check the crude time series as well
cl.describe()

Out[7]:
Open High Low Last Change Settle Volume Open Interest
count 8273.000000 8275.000000 8275.000000 8275.000000 518.000000 8275.000000 8275.000000 8274.000000
mean 41.777084 42.333422 41.186375 41.777544 1.025753 41.777361 112520.504411 125494.688301
std 29.568643 29.946338 29.134537 29.559547 0.871786 29.559366 123411.769025 110104.481053
min 10.000000 11.020000 9.750000 10.420000 0.010000 10.420000 0.000000 0.000000
25% 19.600000 19.790000 19.400000 19.610000 0.400000 19.610000 30618.000000 46379.500000
50% 28.300000 28.650000 27.970000 28.320000 0.810000 28.320000 58659.000000 87553.000000
75% 61.290000 62.150000 60.485000 61.365000 1.470000 61.365000 166439.000000 176041.500000
max 145.190000 147.270000 143.220000 145.290000 7.540000 145.290000 824242.000000 608831.000000

In [8]:
# Hmm. That's a lot more counts. Does the crude time series start earlier than e-mini's?
cl.head()

Out[8]:
Open High Low Last Change Settle Volume Open Interest
Date
1983-03-30 29.01 29.56 29.01 29.40 NaN 29.40 949 470
1983-03-31 29.40 29.60 29.25 29.29 NaN 29.29 521 523
1983-04-04 29.30 29.70 29.29 29.44 NaN 29.44 156 583
1983-04-05 29.50 29.80 29.50 29.71 NaN 29.71 175 623
1983-04-06 29.90 29.92 29.65 29.90 NaN 29.90 392 640

In [9]:
earliest_es_date = es.index[0]

# at first glance, you could just do cl[earliest_es_date:].head()

Out[9]:
Open High Low Last Change Settle Volume Open Interest
Date
1997-09-09 19.43 19.61 19.37 19.42 NaN 19.42 32299 88070
1997-09-10 19.57 19.57 19.35 19.42 NaN 19.42 41858 86872
1997-09-11 19.49 19.72 19.30 19.37 NaN 19.37 52342 80434
1997-09-12 19.42 19.47 19.27 19.32 NaN 19.32 28540 80440
1997-09-15 19.29 19.38 19.23 19.27 NaN 19.27 31610 76590

In [10]:
# but just in case there is no matching precise date, we can also take the closest date:
closest_row = cl.index.searchsorted(earliest_es_date)
cl_close = cl.iloc[closest_row:].Settle
cl_close.head()

Out[10]:
Date
1997-09-09    19.42
1997-09-10    19.42
1997-09-11    19.37
1997-09-12    19.32
1997-09-15    19.27
Name: Settle, dtype: float64

In [11]:
# ok lets just plot this guy
import matplotlib
import matplotlib.pyplot as plt
# use new pretty plots
matplotlib.style.use('ggplot')
# get ipython notebook to show graphs
%pylab inline

es_close.plot()

Populating the interactive namespace from numpy and matplotlib
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe1ef10cf28>

That was satisfying- our all too familar S&P chart. Let's try to plot both S&P and oil in the same graph.

In [12]:
plt.figure()
es_close.plot()
cl_close.plot()
plt.yscale('log')

Meh. Okay. LET's ACTUALLY DO SOME MATH!

(... ahem. stats)

In [13]:
es['Settle'].corr(cl['Settle'])

Out[13]:
0.34046181767351186

okay... MOAR GRAPHS! I hear you say

In [14]:
import pandas as pd
pd.rolling_corr(es_close, cl_close, window=252).dropna()
# why 252? because that's the number of trading days in a year

Out[14]:
Date
2014-11-21   -0.369646
2014-11-24   -0.386641
2014-11-25   -0.404232
2014-11-26   -0.421947
2014-11-28   -0.441608
2014-12-01   -0.457250
2014-12-02   -0.473815
2014-12-03   -0.488969
2014-12-04   -0.502630
2014-12-05   -0.516329
2014-12-08   -0.526947
2014-12-09   -0.536849
2014-12-10   -0.541803
2014-12-11   -0.548337
2014-12-12   -0.549632
2014-12-15   -0.549979
2014-12-16   -0.546815
2014-12-17   -0.551038
2014-12-18   -0.560348
2014-12-19   -0.569132
2014-12-22   -0.577833
2014-12-23   -0.586649
2014-12-24   -0.594898
2014-12-26   -0.603125
2014-12-29   -0.610925
2014-12-30   -0.617783
2014-12-31   -0.622196
2015-01-02   -0.626952
2015-01-05   -0.628422
2015-01-06   -0.627138
                ...   
2016-01-26    0.613887
2016-01-27    0.620853
2016-01-28    0.626633
2016-01-29    0.630797
2016-02-01    0.636777
2016-02-02    0.644177
2016-02-03    0.650006
2016-02-04    0.655308
2016-02-05    0.661402
2016-02-08    0.668422
2016-02-09    0.676663
2016-02-10    0.684008
2016-02-11    0.692094
2016-02-12    0.697270
2016-02-16    0.701433
2016-02-17    0.703887
2016-02-18    0.706600
2016-02-19    0.709578
2016-02-22    0.711551
2016-02-23    0.714437
2016-02-24    0.717001
2016-02-25    0.718281
2016-02-26    0.720613
2016-02-29    0.722511
2016-03-01    0.723214
2016-03-02    0.723336
2016-03-03    0.722921
2016-03-04    0.722644
2016-03-07    0.722566
2016-03-08    0.722830
Name: Settle, dtype: float64

That's weird. You'd expect the first year to drop out (because the rolling correlation window starts after the first year), but it should have started after Sept 1998. Instead it is starting in 2014...

In [15]:
print(len(cl_close))
print(len(es_close))

4646
4738

In [16]:
merged = pd.concat({'es': es_close, 'cl': cl_close}, axis=1)
# maybe this is the culprit?
merged[merged['cl'].isnull()].head()

Out[16]:
cl es
Date
1997-11-27 NaN 959.50
1997-11-28 NaN 955.00
1998-01-19 NaN 972.25
1998-02-16 NaN 1019.00
1998-05-25 NaN 1116.50

In [17]:
merged.dropna(how='any', inplace=True)
# BAD DATA BEGONE!
merged[merged['cl'].isnull()]

Out[17]:
cl es
Date

In [18]:
pd.rolling_corr(merged.es, merged.cl, window=252).dropna().plot()
plt.axhline(0, color='k')

Out[18]:
<matplotlib.lines.Line2D at 0x7fe1e2e5fef0>

Brilliant! But this is still quite inconclusive in terms of equity/crude corr. Why? Well we are forgetting about one HUGE HUGE factor affecting correlation here.

In [19]:
# D'oh
import numpy as np
print('Autocorrelation for a random series is {:.3f}'.format(
    pd.Series(np.random.randn(100000)).autocorr())
)
print('But, autocorrelation for S&P is {:3f}'.format(es_close.autocorr()))

Autocorrelation for a random series is -0.003
But, autocorrelation for S&P is 0.998803

So that's why we should look at %-change instead of $-close or $-change...

In [20]:
daily_returns = merged.pct_change()
rolling_correlation = pd.rolling_corr(daily_returns.es, daily_returns.cl, window=252).dropna()
rolling_correlation.plot()
plt.axhline(0, color='k')
title('Rolling 1 yr correlation between Oil and S&P')

Out[20]:
<matplotlib.text.Text at 0x7fe1e2c89ba8>

Great. Now this is much more interesting. It is quite clear that the period of higher correlation in oil prices came after 2009. Qualitatively, we know (if you worked in finance back then) that this was the case: previously, extreme high oil prices (over $100/bbl) were seen as a drag on the economy. Nowadays, extreme low oil prices are seen as an indication of weakness in global demand, with oil prices, equity, credit etc all selling off hand in hand when there is risk off sentiment.

Let's plot some pretty graphs to show what we know qualitatively, and make sure our memory was correct.

In [21]:
# vertically split into two subplots, and align x-axis
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
fig.suptitle('Checking our intuition about correlation', fontsize=14, fontweight='bold')
# make space for the title
fig.subplots_adjust(top=0.85)

rolling_correlation.plot(ax=ax1) ax1.set_title('Rolling correlation of WTI returns vs S&P returns') ax1.axhline(0, color='k') ax1.tick_params( which='both', # both major and minor ticks bottom='off', top='off', right='off', labelbottom='off' # labels along the bottom edge are off )

cl_close.plot(ax=ax2) ax2.set_title('Price of front month WTI crude') ax2.tick_params(which='both', top='off', right='off') ax2.tick_params(which='minor', bottom='off') ax2.yaxis.set_major_locator(MaxNLocator(5)) # how many ticks

Alright, fine. So we can distinctly see the regime change starting from the European debt crisis, when oil came back down from $150/bbl. Traders no longer saw high oil prices as a drag on the economy, and instead focused on their intention on global demand instead as we entered a period of slow growth.

Also, all the recent talk about equity oil correlation, we have actually seen higher correlations in the 2011-2013 period.

So this is an interesting observation. But as data scientists, we must test this hypothesis! If the cause of this recent spike in equity/crude corr is really driven by risk off sentiment, let's see if there is also much stronger cross asset correlation in other risk assets. Stay tuned for the next part of this series!


Quickstart: TensorFlow-Examples on PythonAnywhere

Aymeric Damien's "TensorFlow Examples" repository popped up on Hacker News today, and I decided to take a look. TensorFlow is an Open Source library Machine Intelligence, built by Google, and Aymeric's examples are not only pretty neat, but they also have IPython notebook versions.

Here's how I got it all running on a PythonAnywhere account, from a bash console:

$ pip install --user --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.7.1-cp27-none-linux_x86_64.whl
$ git clone git@github.com:aymericdamien/TensorFlow-Examples.git
$ cd TensorFlow-Examples/examples/1\ -\ Introduction/
$ python helloworld.py

That printed out Hello, TensorFlow!, so stuff has clearly installed properly. Let's train and run a neural net.

$ cd ../3\ -\ Neural\ Networks/
$ python multilayer_perceptron.py

That downloads some test data (a standard set of images of digits), and trains the net to recognise them.

Now, as I'm a paying PythonAnywhere customer, I can run IPython Notebooks. So, on the "Files" tab, I navigated to the TensorFlow subdirectory of my home directory, then went into the notebooks subdirectory, then down into 3 - Neural Networks. I clicked on the multilayer_perceptron.ipynb file, and got a notebook. It told me that it couldn't find a kernel called "IPython (Python 2.7)", but gave me a list of alternatives -- I just picked "Python 2.7" and clicked OK.

Next, I tried to run the notebook ("Cell" menu, "Run all" option). It failed, saying that it couldn't import input_data. That was easy to fix -- it looks like that module (which is the one that downloads the training dataset) is in the repository's examples subdirectory, but not in the notebooks one. Back to the bash console in a different tab:

$ cp ~/TensorFlow-Examples/examples/3\ -\ Neural\ Networks/input_data.py ~/TensorFlow-Examples/notebooks/3\ -\ Neural\ Networks/

...then back to the notebook, and run all again -- and it starts training my network again :-)

Now, the next step -- to try to understand what all this stuff actually does, and how it works. I suspect that will be the difficult part.


Page 1 of 13.

Older posts »

PythonAnywhere is a Python development and hosting environment that displays in your web browser and runs on our servers. They're already set up with everything you need. It's easy to use, fast, and powerful. There's even a useful free plan.

You can sign up here.