How to use a Random Forest classifier in Python using Scikit-Learn

Random Forest is a powerful machine learning algorithm, it can be used as a regressor or as a classifier. It’s a meta estimator, meaning it’s using a specified number of decision trees to fit and predict.

We’re going to use the package Scikit-Learn in Python, it’s a very useful library which contains a lot of machine learning algorithms and related tools.

Data preparation

To see how Random Forest can be applied, we’re going to try to predict the S&P 500 futures (E-Mini), you can get the data for free on Quandl. Here is what it looks like:

Date Open High Low Last Change Settle Volume Previous Day Open Interest
2016-12-30 2246.25 2252.75 2228.0 2233.5 8.75 2236.25 1252004.0 2752438.0
2016-12-29 2245.5 2250.0 2239.5 2246.25 0.25 2245.0 883279.0 2758174.0
2016-12-28 2261.25 2267.5 2243.5 2244.75 15.75 2245.25 976944.0 2744092.0

The column Change needs to be removed since there’s missing data and this information can be retrieved directly by substracting D close and D-1 close.

Since it’s a classifier, we need to create classes for each line: 1 if the future went up today, -1 if it went down or stayed the same.

import numpy as np
import pandas as pd

def computeClassification(actual):
if(actual > 0):
return 1
else:
return -1

data = pd.DataFrame.from_csv(path='EMini.csv', sep=',')

# Compute the daily returns
data['Return'] = (data['Settle']/data ['Settle'].shift(-1)-1)*100

# Delete the last line which contains NaN
data = data.drop(data.tail(1).index)

# Compute the last column (Y) -1 = down, 1 = up
data.iloc[:,len(data.columns)-1] = data.iloc[:,len(data.columns)-1].apply(computeClassification)

Now that we have a complete dataset with a predictable value, the last colum “Return” which is either -1 or 1, let’s create the train and test dataset.

testData = data[-(len(data)/2):] # 2nd half
trainData = data[:-(len(data)/2)] # 1st half

# X is the list of features (Open, High, Low, Settle)
data_X_train = trainData.iloc[:,0:len(trainData.columns)-1]
# Y is the value to be predicted
data_Y_train = trainData.iloc[:,len(trainData.columns)-1]

# Same thing for the test dataset
data_X_test = testData.iloc[:,0:len(testData.columns)-1]
data_Y_test = testData.iloc[:,len(testData.columns)-1]

Using the algorithm

Once we have everything ready we can start fitting the Random Forest classifier against our train dataset:

from sklearn import ensemble

# I picked 100 randomly, we'll see in another post how to find the optimal value for the number of estimators
clf = ensemble.RandomForestClassifier(n_estimators = 100, n_jobs = -1)
clf.fit(data_X_train, data_Y_train)

predictions = clf.predict(data_X_test)

predictions is an array containing the predicted values (-1 or 1) for the features in data_X_test.
You can see the prediction accuracy using the method accuracy_score which compares the predicted values versus the expected ones.

from sklearn.metrics import accuracy_score

print "Score: "+str(accuracy_score(data_Y_test, y_predictions))

What’s next ?

Now for example you can create a trading strategy that goes long the future if the predicted value is 1, and goes short if it’s -1. This can be easily backtested using a backtest engine such as Zipline in Python.
Based on your backtest result you could add or remove features, maybe the volatility or the 5-day moving average can improve the prediction accuracy ?