Gradient Boosting: The Coolest Kid on the Machine Learning Block
By Jake Hoare
Gradient boosting is a technique attracting attention for its prediction speed and accuracy, especially with large and complex data. As evidenced in the chart below showing the rapid growth of Google searches for xgboost (the best gradient boosting R package). From data science competitions to machine learning solutions for business, gradient boosting has produced best-in-class results. In this blog post I describe what it is and how to use it in Displayr.
Gradient boosting is a type of boosting. Boosting is a method for combining a series of simple individual models to create a more powerful model. Start by fitting an initial model (either a tree or linear regression) to the data. Then build a second model that focuses on accurately predicting the cases where the first model performs poorly. The combination of these two models is expected to be better than either model alone. Repeat the process many times. Each successive model attempts to correct for the shortcomings of the combined ensemble of all previous models.
The best possible next model, when combined with previous models, minimizes the overall prediction error. The key idea behind gradient boosting is to set the target outcomes for this next model in order to minimize the error. The target outcome for each case in the data set depends on how much a change in that case’s prediction impacts the overall prediction error.
If, for case xi, a small change in the prediction causes a large drop in error, then the next target outcome is a high value. Predictions from the new model that are close to its targets will reduce the error.
If, for case xj, a small change in the prediction causes no change in error, then the next target outcome is zero because changing this prediction does not decrease the error.
The name gradient boosting arises because of setting target outcomes based on the gradient of the error with respect to the prediction of each case. Each new model takes a step in the direction that minimizes prediction error, in the space of possible predictions for each training case.
Time for a drink
Maybe that explanation was a bit heavy, so let’s relax. To be more specific, how about a glass of wine? The data set I will be using to illustrate Gradient Boosting describes nearly 5000 Portuguese white wines (described here). You can replicate the analysis used in this post in Displayr. Displayr uses the R xgboost package and you can do your complete analysis, from data import to charting.
As a preparatory step, I split the data into a training and testing set. This is done with Insert > Utilities > Filtering > Create Train-Test Split. This allows us to fit the model to the training set and assess the performance of the model within the testing set.
Next, I add the gradient booster with Insert > More > Machine Learning > Gradient Boosting, and select quality as the Outcome variable. Quality is the median of at least 3 evaluations made by wine experts, on a scale from 1 to 10.
I use all 11 other variables for the Predictors. These describe the chemical composition of the wine. I produce an Output of Importance with the gbtree Booster. Boosting normally uses an underlying tree model, although a linear regression as also available as an option. Finally, I set the filter to Training sample in Home > Filter (Data Selection), and the result is shown below. The importance scores tell us that the alcohol content is by far the the most important predictor of quality, followed by the volatile acidity.
Now let’s find the accuracy of this model on the test data. To do so:
- Select the output from the gradient boosting.
- Select Insert > More > Machine Learning > Diagnostic > Prediction-Accuracy Table.
- With the prediction accuracy table selected, choose the Testing sample filter in Home > Filter (Data Selection).
The prediction-accuracy table illustrates the overlap between the categories which are predicted by the model and the original category of each case within the testing sample. The overall accuracy, shown beneath the chart, is 64.74%. However, there is one more trick to enhance this. By checking the Grid search box on the original model inputs, the boosting algorithm varies its internal settings in an attempt to find the best settings for this data. This takes more time to run, but accuracy increases to 65.69% as shown below.
How good is this model? For comparison I’ll repeat the analysis with CART instead of gradient boosting. (In Displayr, this is done using Insert > More > Machine Learning > Classification and Regression Trees (CART)). The accuracy is much lower, at around 50%. Since alcohol is used for the first split of the tree, the resulting Sankey diagram below agrees with the boosting on the most important variable. So next time you buy the most alcoholic bottle of wine, you can tell yourself it’s really quality you are buying!
The examples in this post use Displayr as a front-end to running the R code. If you go into our example document, you can see the outputs for yourself. The code that has been used to generate each of the outputs is accessible by selecting the output and clicking Properties > R CODE on the right hand side of the screen. The heavy-lifting is done with the xgboost package, via our own package flipMultivariates (available on GitHub), and the prediction-accuracy tables are found in our own flipRegression(also available on GitHub).
TRY IT OUT
You can replicate this analysis for yourself in Displayr.