Last month, wise.io co-founder Joey wrote a blog post on the principles of doing proper benchmarks of machine learning frameworks. Here, I start to put those principles into practice, presenting the first in a series of blog posts on ML benchmarking. For those short of time, you can jump to conclusions.
For this benchmark, I focus on comparing accuracy and speed of four random forest®1 classifier implementations, including the high-performance WiseRF™. In follow-up posts we will cover random forest regression and benchmark against other machine learning algorithms. We will also benchmark memory usage across different implementations — another very important, but often overlooked aspect of benchmarking, especially when it comes to “big data.”
We created a standardized benchmarking platform to compare the accuracy and speed of the following random forest implementations:
- WiseRF™ version 1.5.8; download an evaluation version here
- scikit-learn version 0.13.1, Python version 2.7.3
- H2O version ‘fourier 1’, OpenJDK 7
- R (randomForest) version 4.6-7, R version 2.15.3
All of the tools, with the exception of the randomForest R package, are multi-threaded and were parallelized across available cores. H2O has the option of running a distributed (multi-machine) implementation, but I considered herein the more common single-node workstation for this benchmarking exercise.
All the benchmarks were run on the same Amazon EC2 instance with the following specs:
- Ubuntu Server 12.04 LTS (64 bit)
- M2 High-Memory Quadruple Extra Large (m2.4xlarge)
- 26 ECUs, 8 vCPUs, 68.4 GB RAM
- Java max heap size was set to 60 GB for H2O.
- The OS was installed on an 8 GB hard drive, but the data, binaries and benchmarking code were hosted on a 200 GB mounted Amazon Elastic Block Store (EBS) volume.
- Since EC2 instances do not come with a swap area by default, an additional 70 GB EBS volume was attached and designated as swap (see below).
Before each run, we cleared the pagecache, dentries and inodes and wrote any data buffered in memory to disk using the following command:
sudo bash -c "sync; echo 3 > /proc/sys/vm/drop_caches"
No other processes were executed during the benchmarking.
Datasets: Big and Small
For our first example, we chose a familiar multi-class benchmarking dataset, the Forest Covertype dataset from the UCI Machine Learning Repository. This dataset contains 581,012 instances of 54 mixed-type features, which can be used to classify forest cover type as one of seven classes. The data file, in csv format, is 72 MB. We created a training set containing 80% randomly chosen instances (464,809) and a test set of the remaining 116,203 instances. Although the Covertype dataset is larger than most datasets used in a typical machine learning benchmarking exercise, it is still a toy example compared to many real-world machine learning problems. When it comes to benchmarking, it’s important to use a range of dataset types and sizes.
For our second example, we used the larger version of the popular MNIST handwritten digit recognition dataset, which is available here in the sparse libsvm format. There are 8.1 million instances, 784 features and ten classes. The size of the sparse data file is 12 GB and the dense (csv) version is 14 GB. A training set of 7 million instances was chosen randomly, leaving the remaining 1.1 million instances for a test set. We will see below that WiseRF™ was the only tool that successfully trained a random forest model from the big MNIST dataset on our benchmarking server, prior to adding an extra drive to be used as virtual memory. After the other tools failed, we decided to attach an extra 70 GB EBS volume to use as a swap area. With this server upgrade, H2O was able to train a model, but scikit-learn and R still ran out of memory.
A random seed was used for reproducibility of the train/test split. A tarball of the benchmarking code, which includes the specific train/test splits, is linked below.
Keep it Simple: Default parameters
As Joey mentioned in his previous blog post, it’s common for some users to use only the default parameters specified by the implementation. Thus, it’s important for software authors to do a good job choosing the default values. We started our benchmarking exercise by training a 50-tree forest for each tool using all of the default parameter values. The classification error on the Covertype test set for WiseRF™ , scikit-learn, H2O and R was 3.154%, 3.088%, 4.041% and 3.463%, respectively. However, experienced users will evaluate at least a handful of parameter combinations in order to achieve higher accuracy.
Explore the Grid: Tuning parameters
A parameter people often choose to explore in a random forest model is the
mtry parameter, which is the number of variables (features) randomly sampled as candidates at each split of a decision tree. Another parameter we considered is splitting criterion. WiseRF™ , scikit-learn and H2O offer both Gini impurity and information gain (entropy) as a splitting criterion for decision tree classification. WiseRF™ , scikit-learn and H2O refer to this parameter as
statType, respectively. R’s randomForest package is based on Leo Breiman and Adele Culter’s original Fortran code, which grows trees using the traditional CART algorithm. Thus, it does not provide the option of changing the splitting criterion and uses Gini by default.
Both WiseRF™ and R’s randomForest package support sampling with and without replacement to create datasets from which individual trees are grown, and allow the user to specify the sampling rate. The H2O implementation uses sampling without replacement and a user-specified sampling rate, but does not currently support traditional bootstrapping (sampling with replacement). The scikit-learn implementation supports sampling with and without replacement through its binary
bootstrap option, but does not provide an option to specify the sampling rate.
For this exercise, we trained models across a grid of various
mtry values, the two splitting criteria mentioned above (when available), and a range of sampling strategies. All remaining model parameters were fixed at their default values. We found that on both datasets, information gain (entropy) was the better choice for splitting criterion. This demonstrates a disadvanage of the R implementation, which does not currently support an entropy-based criterion. For the Covertype dataset, in addition to the two splitting criteria, we considered
mtry = 7, 9, 11, 13, 15, 17, 19 paired with sampling rates of
67, 80, 90, 95, 100 percent (without replacement) as well as traditional boostrapping when available. Below we report the performance of the most accurate model, as evaluated on the test set, for each of the four implementations.
Using the parameter grid defined above, we trained a random forest for each combination of parameters and evaluated the classification error on the test set. WiseRF™ achieved the lowest test error using
mtry=9, a sampling rate of 90% (ie.
with-replacement=false, lambda=0.9) and
criteria=infogain. H2O’s best parameters were
mtry=19, a 95% sampling rate and
statType=entropy. For scikit-learn,
bootstrap=False achieved the lowest test error. Lastly, R’s best model used
mtry=19 with a 100% sampling rate (ie.
sampsize equal to the number of rows).
The following plot shows the minimal classification error observed on the test set across all models for a 50-tree random forest trained on the Covertype data.
By default, WiseRF™ will use an optimal number of threads to maximize the workload across multiple cores. Scikit-learn can also be parallelized using the
n_jobs argument, but is single-threaded by default. H2O has a binary
parallel option which is set to
1 (parallel) by default. With the exception of the R implementation, we parallelized each run across the 8 cores of our benchmarking server. For WiseRF™ and scikit-learn, we used 10 threads.
Timing for both training and prediction (on the test set) is self-reported in the log files of WiseRF™ and H2O. For scikit-learn and R, we computed these times manually in our benchmarking code. Below are the training and prediction times for the models with minimal classification error on the test set for the Covertype data. Recall that the train and test set sizes are 464,809 and 116,203 instances, respectively.
We see that WiseRF™ is leading the pack in terms of both training time and prediction speed. An interesting observation is that R actually performs reasonably well in terms of prediction speed, although as expected, R is much slower when it comes to training time. H2O appears to have some bottlenecks when it comes to prediction, but is relatively fast in the model training phase. WiseRF™ is several orders of magnitude faster than scikit-learn on this dataset. With certain data types, WiseRF™ can achieve even greater speed efficiencies than exhibited in the Covertype dataset comparison.
Graduating to “Big Data”
The Covertype dataset is a “small data” example that allows us to compare WiseRF™ with several common random forest implementations. However, WiseRF™ was designed from the ground up with scalability in mind, so it has an even greater ability to shine when it comes to “big data.” As described above, there is an XL version of the popular MNIST dataset containing 8.1 million instances, 7 million of which we used for training.
WiseRF™ was the only tool that was capable of training a model on the 14 GB MNIST dataset on our original benchmarking server. Since EC2 instances do not come with a swap area by default, we decided to attach an extra 70 GB EBS volume to be used as a swap area, bringing the total virtual + physical memory size to ~140 GB. After adding a large swap area, H2O was able to train a 10-tree model, but the other tools failed. In scikit-learn and R, we ran into segmentation faults, so the bottleneck with these tools is memory. The benchmarking server in its current configuration is not sufficient to support learning (even a very small) random forest with either of these tools.
For WiseRF™ and H2O, we trained a 10-tree forest using information gain (entropy) as the splitting criterion,
mtry = 27, 29, 31, 33, 35, 37, 39 and the same sampling rates as above. We found that a sampling rate of 100%, without replacement, produced the lowest classification error on the test set for both WiseRF™ and H2O. This suggests that the random feature selection at each split introduces enough randomness into the model, and that using the original dataset to grow individual trees is beneficial, at least for this example. The lowest test error observed for WiseRF™ and H2O was 0.0741% and 0.0755%, respectively — very close. However, the training and prediction times for WiseRF™ were orders of magnitude faster than for H2O. Below is a plot of the training and prediction times for the optimal model (as measured by test error) for WiseRF™ and H2O.
If your constraint is time, and for most people it is (!), you can train a much more accurate model using WiseRF™ (by adding more trees) in the same amount of time it takes to train a less accurate model in H2O. In the same amount of time it takes for H2O to generate a 10-tree forest (0.0755% test error), WiseRF™ can train a 40-tree forest with nearly a five-fold reduction in test error (0.0158%). Bear in mind that H2O also has a distributed random forest that runs on a cluster, however our benchmarking server was designed to model a high-memory, single-node setting.
It’s important to note here just how good the accuracies are when training on this large dataset. For the smaller ~50 MB MNIST dataset, accuracies of order 1% are considered state of the art. Here we get close to 2 orders of magnitude better with just 40 trees in just minutes. This is clear demonstration of power of more data (a la Peter Norvig). But better accuracy is only obtainable for the implementations that don’t buckle under the data enormity and fail to train.
The benchmarking platform code is available in full here. The datasets and binaries are not included in this code bundle, but are easily downloaded through the install scripts. I recommend using the same server configuration that is described above. New datasets or software tools can easily be added for further benchmarking.
Supposed implementations of the Breiman’s algorithm (and extensions thereon) are all manifestly different.
Those we tested yielded different accuracies, speeds, and memory usage when we used the same data and same parameters. We were somewhat surprised by the differences in accuracy but, perhaps, shouldn’t have been in retrospect.
Given the above, using default parameters does not yield the best performance.
The R-version of RF performed with worst accuracy on the covertype dataset, because it only uses the Gini split criteria, which performs worse on the covertype dataset.
To perform a fair apples-to-apples comparison, each framework should be trained over a grid search of hyperparameters. Anyone claiming a result relative to other implementations should show evidence that they did more than just use default parameters during the comparison. (In a future post we’ll show how different relative performances can be on different datasets.)
One cannot use R and Python to learn random decision forests on > 10 GB datasets on any reasonably sized single computer due to memory overhead. Only H20 and WiseRF™ were even capable of training.
For a fixed amount of time to train, WiseRF™ yielded a 5× improvement in accuracy over the single-node H2O on the 14 GB dataset.
As a data scientist, I love R and Python/scikit-learn but these dont appear to be “big data” ready, at least in the context of learning on GB-sized datasets. It’s also pretty clear that when we talk about machine learning on big data, we’re naturally pushed to metrics on speed and accuracy. But, as we’ve seen, it is the memory usage that is the bottleneck and ultimate strangler for many implementations. Without resorting to making approximations that would otherwise weaken the predictive power, WiseRF™ uses about 20-50% as much RAM as the dataset size on disk, whereas other implementations use 10-100× as much.
In future posts, we will be benchmarking WiseRF™ against a multitude of other ML algorithms on a variety of prediction problems. Posts about those benchmarks will be published on our blog and linked here. If you have a particular dataset or ML code that you would like us to test against, or if you’d like to share your own benchmarking results, please contact me.
Random Forest is a trademark of Salford Systems, Inc.↩