How to use a Random Forest classifier in Python using Scikit-Learn

Random Forest is a powerful machine learning algorithm, it can be used as a regressor or as a classifier. It’s a meta estimator, meaning it’s using a specified number of decision trees to fit and predict.

We’re going to use the package Scikit-Learn in Python, it’s a very useful library which contains a lot of machine learning algorithms and related tools.

Data preparation

To see how Random Forest can be applied, we’re going to try to predict the S&P 500 futures (E-Mini), you can get the data for free on Quandl. Here is what it looks like:

Date Open High Low Last Change Settle Volume Previous Day Open Interest
2016-12-30 2246.25 2252.75 2228.0 2233.5 8.75 2236.25 1252004.0 2752438.0
2016-12-29 2245.5 2250.0 2239.5 2246.25 0.25 2245.0 883279.0 2758174.0
2016-12-28 2261.25 2267.5 2243.5 2244.75 15.75 2245.25 976944.0 2744092.0

The column Change needs to be removed since there’s missing data and this information can be retrieved directly by substracting D close and D-1 close.

Since it’s a classifier, we need to create classes for each line: 1 if the future went up today, -1 if it went down or stayed the same.

import numpy as np
import pandas as pd

def computeClassification(actual):
if(actual > 0):
return 1
else:
return -1

data = pd.DataFrame.from_csv(path='EMini.csv', sep=',')

# Compute the daily returns
data['Return'] = (data['Settle']/data ['Settle'].shift(-1)-1)*100

# Delete the last line which contains NaN
data = data.drop(data.tail(1).index)

# Compute the last column (Y) -1 = down, 1 = up
data.iloc[:,len(data.columns)-1] = data.iloc[:,len(data.columns)-1].apply(computeClassification)

Now that we have a complete dataset with a predictable value, the last colum “Return” which is either -1 or 1, let’s create the train and test dataset.

testData = data[-(len(data)/2):] # 2nd half
trainData = data[:-(len(data)/2)] # 1st half

# X is the list of features (Open, High, Low, Settle)
data_X_train = trainData.iloc[:,0:len(trainData.columns)-1]
# Y is the value to be predicted
data_Y_train = trainData.iloc[:,len(trainData.columns)-1]

# Same thing for the test dataset
data_X_test = testData.iloc[:,0:len(testData.columns)-1]
data_Y_test = testData.iloc[:,len(testData.columns)-1]

Using the algorithm

Once we have everything ready we can start fitting the Random Forest classifier against our train dataset:

from sklearn import ensemble

# I picked 100 randomly, we'll see in another post how to find the optimal value for the number of estimators
clf = ensemble.RandomForestClassifier(n_estimators = 100, n_jobs = -1)
clf.fit(data_X_train, data_Y_train)

predictions = clf.predict(data_X_test)

predictions is an array containing the predicted values (-1 or 1) for the features in data_X_test.
You can see the prediction accuracy using the method accuracy_score which compares the predicted values versus the expected ones.

from sklearn.metrics import accuracy_score

print "Score: "+str(accuracy_score(data_Y_test, y_predictions))

What’s next ?

Now for example you can create a trading strategy that goes long the future if the predicted value is 1, and goes short if it’s -1. This can be easily backtested using a backtest engine such as Zipline in Python.
Based on your backtest result you could add or remove features, maybe the volatility or the 5-day moving average can improve the prediction accuracy ?

Using feature selection to improve a machine learning strategy

For this tutorial, we’re going to assume we have the same basic structure as in the previous article about the Random Forest article. The idea is to do some feature engineering to generate a bunch of features, some of them may be useless and reduce the machine learning algorithm prediction score, that’s where the feature selection comes into action.

Feature engineering

This is not a tentative of a perfect feature engineering, we just want to generate a good number of features and pick the most relevant afterwards. Depending on the dataset you have, you can create more interesting feature like the day, the hour, if it’s the weekend or not etc.
Let’s assume we only have one column, ‘Mid’ which is the mid price between the bid and the ask. We can generate moving average for various windows, 5 to 50 for example, the code is quite simple using pandas:

for i in range(5, 50, 5):
data["mavgMid"+str(i)] = pd.rolling_mean(data["Mid"], i, min_periods=1)

This way we get new columns: MavgMid5, MavgMid10 and so on.
We can also do that for the moving standard deviation which can be useful for a machine learning algorithm, almost the same code as above:

for i in range(5, 50, 5):
data["stdMid"+str(i)] = pd.rolling_std(data["Mid"], i, min_periods=1)

We can continue with various rolling indicators, see the full list here. I personally like rolling_corr() because in the crypto-currencies world, correlation is very volatile and contains a lot of information, especially for inter exchange arbitrage opportunities. In this case you need to add another column with prices from another source.

Here is an example of a full function:

def featureEngineering(data):
# Moving average
for i in range(5, 50, 5):
data["mavgMid"+str(i)] = pd.rolling_mean(data["Mid"], i, min_periods=1)

# Rolling standard deviation
for i in range(5, 50, 5):
data["stdMid"+str(i)] = pd.rolling_std(data["Mid"], i, min_periods=1)

# Remove the 50 last rows since 50 is our max window
data = data.drop(data.head(50).index)

return data

Feature selection

After the feature engineering step we should have 20 features (+1 Signal feature). I ran the algorithm with the same parameters as in the previous article, but on XMR-BTC minute data over a week using the Crypto Compare API (tutorial to come soon) and I got the decent score of 0.53.

That’s a good score but maybe our 20 features are messing with the Random Forest ability to predict.

We’re going to use the SelectKBest algorithm from Sci-kit learn which is quite efficient for a simple strategy, we need to add some import in the code first:

from sklearn.feature_selection import SelectKBest, f_classif

SelectKBest() takes 2 parameters at minimum: an algorithm, here we picked f_classif since we’re using Random Forest Classifier and the number of features you want to keep:

data_X_train = SelectKBest(f_classif, k=10).fit_transform(data_X_train, data_Y_train)
data_X_test = SelectKBest(f_classif, k=10).fit_transform(data_X_test, data_Y_test)

Now data_X_train and data_X_test contains 10 features each, selected using the f_classif algorithm.

Finally the score I got with my XMR-BTC dataset is 0.60, 6% is a pretty nice improvement for a basic feature selection. I picked 10 randomly as a number of feature to keep, but you can loop through different number to determine the best number of features, but be careful of over fitting!