Numerai Tutorial Notebooks

One project I enjoyed working on recently was writing new tutorial notebooks for the Numerai data science tournament.

In my day to day job as CTO, I don’t get to write much code that makes it into production, so it was a joy for me to be able to work on a small codebase by myself and being able to push to master without having to go through code review 😇.

Background

Numerai is quantitative hedge fund powered a data science tournament. One key driver of hedge fund performance is the growth of tournament users who create the machine learning models.

When new users join the tournament, they are shown these 3 tutorials to help them get started quickly.

image

Setup

I really liked the way I set the workflow of this project. Definitely one of the smoothest notebook development and sharing experiences ever.

The Notebooks

Build your first model (Hello Numerai)

This is the hello world for Numerai. I walk through the absolute basics of the Numerai dataset and how to train your first model using our example code based on LightGBM. Along the way, I demonstrate how to interact with our API to download the data, upload submissions, and explain how scoring works.

One of the main challenges creating this notebook was deciding what to put in and what to leave out. How much information does a new user actually need to get started? I ended up glossing over a lot of things for the sake of streamlining the overall experience. But I think it strikes the right balance of getting new users a little taste of everything vs getting them to the “magic moment” ASAP.

Learn about risk (Feature Neutralization)

One key thing not mentioned in Hello Numerai is how to actually improve the performance of your model beyond the example model provided. Now there is actually a ton of information in our forums on this topic, but it is incredibly difficult to sift through all of it. My approach here was to pick one topic to go deep on, and use it as an example for how to run experiments in general.

I ended up focusing this chapter on Feature Neutralization, a concept that is very much idiosyncratic to Numerai. The key idea is that because the stock market is non-stationary, exposure to features poses a risk which can hurt performance, and feature neutralization is a way to control this risk.

Create an ensemble (Target Ensemble)

For this final notebook, I wanted to pick a subject to explore that gave me an excuse to dive deeper into the dataset and build a more complex model. One key thing that

I ended up picking the idea of building a ensemble model using multiple targets, since this was a concept that only a few advanced users knew about. The key thing that I explain here is that apart from the main “target” that models are scored against, we actually have many “auxiliary” targets in the dataset that can be used to build ensembles.

Bonus: Model Uploads

One related project worth mentioning here is Model Uploads, which is a new feature that my team developed in conjunction with these new notebooks.

In order for users to compete in the tournament, they must setup and deploy their models into a production environment to generate predictions on live data every day. This is because the tournament needs to integrate with our daily trading pipieline.

image

This is a huge pain for new users (who are generally data scientists and mathematicians, not infrastructure engineers) and many of them churn at this point without getting into the meat of the core data science problem.

image

Model Uploads is essentially a free and simple to use model hosting platform that we provide to abstract away all the complexities of setting up a production environment, model deployment, and integrating with our data and submission APIs.

image

With Model Uploads, users can now deploy their model straight from a Jupyter notebook and manage their models with this simple management console UI on our website.

image

You can read more about Model Uploads in this medium post I wrote:

It was only by getting all of this infrastructure complexity out of the way that we could make the new user onboarding experience streamlined and focused on the core data science problem, which is the more interesting and engaging part of Numerai.