In a previous post, we announced the partnership between Continuum Analytics and wise.io to bring fast and memory-efficient machine learning to data scientists and programmers. wiseRF, the implementation of the Random Forest algorithm from wise.io, is currently available on Anaconda Pro. As you’ll see, even this entry tier of wiseRF outperforms other implementations available in the wild.
In this follow-up post, we show how easy it is to use wiseRF within Python to learn a prediction model on data and generate predictions for future data. In a few lines of code, wiseRF can be deployed to ask deep questions about complex, noisy, and big data.
We also benchmark the performance of wiseRF on two different data sets, demonstrating that it enjoys an order-of-magnitude advantage in speed over the random forest implementation of scikit-learn in training. This allows data scientists to build workflows that:
- search for and find the optimal prediction model in an order of magnitude less time,
- re-fit the model more frequently on streaming data to get the most up-to-date insight into their data, and
- train random forest models on extremely large data sets where other methods fail.
How to use wiseRF
Here, we demonstrate how to use wiseRF to train a classifier on R.A. Fisher’s famous Iris data set and to use that classifier to predict the label (type of Iris) for each new iris from the lengths and widths of the sepal and petal of each flower. The challenge in this problem is to discover the appropriate boundaries between the three different iris species in the 4-dimensional data. The Iris data is a tiny dataset so we use it here to show the basic functionality and the baseline improvements you can get with wiseRF over other codes.
In this demo, we are using scikit-learn version 0.12.1 and wiseRF version 1.1. First we load in the iris data set from scikit-learn and split them into a random 90% training set and 10% testing set:
import sklearn from sklearn.datasets import load_iris # Load the data. Sklearn has some convenient methods for this. data = load_iris() inds = arange(len(data.data)) # Make a synthetic 90% training / 10% testing set test_i = random.sample(xrange(len(inds)), int(0.1*len(inds))) train_i = np.delete(inds, test_i) print "%d instances in training set, %d in test set" \ % (len(train_i), len(test_i)) # The training and testing features (X) and classes (y) X_train = data.data[train_i,:] y_train = data.target[train_i] X_test = data.data[test_i,:] y_test = data.target[test_i]
Now, we can fit a wiseRF random forest model on the training set with a few simple lines of code:
from PyWiseRF import WiseRF # Build a 10-tree classifier and predict on the test set with WiseRF rf = WiseRF(n_estimators=10) rf.fit(X_train, y_train)
Average training time = 1.21 ms. In comparison, scikit-learn random forest takes 6.24 ms on a single core.
Once we have fit the model, we can easily predict on the testing data and evaluate the predictive performance of the wiseRF classifier:
# predict classes for the testing data ypred_test = rf.predict(X_test) # evaluate accuracy of the classifier over the testing data print "Accuracy score: %0.2f" % rf.score(X_test, y_test)
Accuracy score = 1.00, meaning that 100% of the testing classifications are correct.
To take advantage of multiple cores in training the wiseRF model, simply specify the n_jobs keyword in the rf.fit function. Likewise, setting n_jobs = -1 uses all available cores.
# fit a 1000-tree random forest on 1 core rf = WiseRF(n_estimators=1000, n_jobs = 1) rf.fit(X_train, y_train) # fit a 1000-tree random forest on 4 cores rf_multi = WiseRF(n_estimators=1000, n_jobs = 4) rf_multi.fit(X_train, y_train)
Compute time on a single core = 122.12 ms (555.41 ms in scikit-learn) Compute time on 4 cores = 59.24 ms (1217.49 ms in scikit-learn with n_jobs = 4)
Benchmarks: wiseRF versus scikit-learn
We use a slightly larger data set to compare the performance of wiseRF to scikit-learn. The MNIST Handwritten Digits data set consists of 70,000 pixelated images of handwritten digits, from 0 through 9, each image measuring 28-by-28 pixels. The classification goal is to predict the true digit from the raw pixel values of an image. To perform the comparison, we use 63,000 images as training data and a random 7,000 as testing data.
from sklearn.datasets import fetch_mldata mnist = fetch_mldata('MNIST original') # Define training and testing sets inds = arange(len(mnist.data)) test_i = random.sample(xrange(len(inds)), int(0.1*len(inds))) train_i = numpy.delete(inds, test_i) X_train = mnist.data[train_i].astype(numpy.double) y_train = mnist.target[train_i].astype(numpy.double) X_test = mnist.data[test_i].astype(numpy.double) y_test = mnist.target[test_i].astype(numpy.double)
We time the whole process of training the random forest on the 63k training digits and predicting (& returning an accuracy score) on the 7k testing digits. We do this both for scikit-learn and wiseRF, first for a single core:
# scikit-learn single core, MNIST data from time import time from sklearn.ensemble import RandomForestClassifier t1 = time() rf = RandomForestClassifier(n_estimators=10, n_jobs=1) rf.fit(X_train, y_train) score = rf.score(X_test, y_test) t2 = time() dt = t2-t1 print "Accuracy: %0.2f\t%0.2f s" % (score, dt)
scikit-learn: Accuracy = 95%, Total training & prediction time = 121.14 s
# wiseRF single core, MNIST data t1 = time() rf = WiseRF(n_estimators=10, n_jobs=1) rf.fit(X_train, y_train) score = rf.score(X_test, y_test) t2 = time() dt = t2-t1 print "Accuracy: %0.2f\t%0.2f s" % (score, dt)
wiseRF: Accuracy = 94%, Total training & prediction time = 16.89 s
On a single core, wiseRF enjoys a factor of 7 boost in speed over scikit-learn with a comparable accuracy.
On 4 cores, the performance metrics are: scikit-learn: Accuracy = 95%, Total training & prediction time = 49.51 s wiseRF: Accuracy = 94%, Total training & prediction time = 6.61 s giving wiseRF a 7.5x advantage in speed over scikit-learn.
For the two problems shown above, wiseRF is at least 5x faster and sometimes as much as 100x faster than scikit-learn’s random forest, with the factor improvement depending on the number of trees and number of cores used for training. For the MNIST data, wiseRF on a single core outperforms scikit-learn on 4 cores—-by a factor of 3—-in terms of speed. Additionally, wiseRF shares the data set amongst all cores so it has ¼th of the memory requirement of scikit-learn on an 4 core machine.
In a future post, we will detail the memory efficiency of wiseRF, and demonstrate that it can train on REALLY big data sets where other random forest implementations fail. This power to train on tremendously large data sets gives you the ability to unleash the knowledge in ALL of your data. To train on larger data sets, wise.io offers a version of WiseRF that is not limited in any way. With WiseRF Oak, you can build classfiers on millions of instances on your ultrabook and scale to 64+ core machines in the cloud. See our website or contact us to find out more.