https://www.flickr.com/photos/mateeas/8072522820/

Using feature selection to improve a machine learning strategy

For this tutorial, we’re going to assume we have the same basic structure as in the previous article about the Random Forest article. The idea is to do some feature engineering to generate a bunch of features, some of them may be useless and reduce the machine learning algorithm prediction score, that’s where the feature selection comes into action.

Feature engineering

This is not a tentative of a perfect feature engineering, we just want to generate a good number of features and pick the most relevant afterwards. Depending on the dataset you have, you can create more interesting feature like the day, the hour, if it’s the weekend or not etc.
Let’s assume we only have one column, ‘Mid’ which is the mid price between the bid and the ask. We can generate moving average for various windows, 5 to 50 for example, the code is quite simple using pandas:

for i in range(5, 50, 5):
data["mavgMid"+str(i)] = pd.rolling_mean(data["Mid"], i, min_periods=1)

This way we get new columns: MavgMid5, MavgMid10 and so on.
We can also do that for the moving standard deviation which can be useful for a machine learning algorithm, almost the same code as above:

for i in range(5, 50, 5):
data["stdMid"+str(i)] = pd.rolling_std(data["Mid"], i, min_periods=1)

We can continue with various rolling indicators, see the full list here. I personally like rolling_corr() because in the crypto-currencies world, correlation is very volatile and contains a lot of information, especially for inter exchange arbitrage opportunities. In this case you need to add another column with prices from another source.

Here is an example of a full function:

def featureEngineering(data):
# Moving average
for i in range(5, 50, 5):
data["mavgMid"+str(i)] = pd.rolling_mean(data["Mid"], i, min_periods=1)

# Rolling standard deviation
for i in range(5, 50, 5):
data["stdMid"+str(i)] = pd.rolling_std(data["Mid"], i, min_periods=1)

# Remove the 50 last rows since 50 is our max window
data = data.drop(data.head(50).index)

return data

Feature selection

After the feature engineering step we should have 20 features (+1 Signal feature). I ran the algorithm with the same parameters as in the previous article, but on XMR-BTC minute data over a week using the Crypto Compare API (tutorial to come soon) and I got the decent score of 0.53.

That’s a good score but maybe our 20 features are messing with the Random Forest ability to predict.

We’re going to use the SelectKBest algorithm from Sci-kit learn which is quite efficient for a simple strategy, we need to add some import in the code first:

from sklearn.feature_selection import SelectKBest, f_classif

SelectKBest() takes 2 parameters at minimum: an algorithm, here we picked f_classif since we’re using Random Forest Classifier and the number of features you want to keep:

data_X_train = SelectKBest(f_classif, k=10).fit_transform(data_X_train, data_Y_train)
data_X_test = SelectKBest(f_classif, k=10).fit_transform(data_X_test, data_Y_test)

Now data_X_train and data_X_test contains 10 features each, selected using the f_classif algorithm.

Finally the score I got with my XMR-BTC dataset is 0.60, 6% is a pretty nice improvement for a basic feature selection. I picked 10 randomly as a number of feature to keep, but you can loop through different number to determine the best number of features, but be careful of over fitting!