Boosting , Bagging, Random Forest

Abby
5 min readJun 2, 2021

There may be scenarios where in it might not be possible to calculate the standard deviation of dataset, under such conditions Bootstrap becomes useful and can improve the statistical methods such as decision trees.

Sometimes it might also happen that the decision trees have high variance, which is to say that if we split the training dataset into halves and run decision tree algorithms on them, the two outputs could give quite different results. On the other hand linear regression tends to give low variance when being applied repeatedly on distinct datasets. Under such scenarios bootstrap aggregation or bagging is a useful and affective technique.

As a general concept we know that averaging a set of observations tends to reduce variance. However in a practical scenario this might not be achievable because we might not have access to multiple training sets. Under such cases we bootstrap by taking repeated samples from the same training set resulting in B different bootstrapped training sets. We then train our method on the bth bootstrapped training set in and finally average all the predicted values.

Bagging can improve predictions and is particularly useful in decision trees. In practice we use a large B so that the errors settle down. There is also a very straightforward way of estimate the test error of a bagged model with even performing cross validation. It is known as out-of-bag (OOB). This is derived from the fact that the key to bagging is repeated fit to bootstrapped subsets of the observations and the fact that a bagged tree normally uses around 2/3 of the observation and the remaining 1/3 is not used to fit a bagged tree.

One of the biggest advantages of decision tree is its pictorial representation. However that gets distorted when we bag large number of trees. While we increase prediction accuracy we give up interpretability. Although the collection of bagged trees is much more difficult to interpret than a single tree we can derive a good summary of the importance of each predictor using the RSS in case of regression tree or Gini Index in case of classification tree. For regression trees we calculate the total amount the RSS has decreased due to splits over a set and average over all B trees. A large value signifies that to be an important predictor. While in case of classification tree we get the Gini indexes and average it out over all B trees.

Each time we split a random sample of predictors is chosen typically where chosen

Now let’s assume that we applied bagging and got one very strong predictor in the dataset and some moderately strong ones. Then in the collection of bagged trees, most of the tree would be using this strong predictor in the top spilt. As a result we will have a highly correlated bagged tree. To overcome this correlation we use Random forest. We build number of decision trees on training samples. But each time we split a random sample of predictors is chosen typically where chosen

This is also primarily because averaging many hugely correlated quantities does not lead to a large reduction in the variance as averaging many uncorrelated quantities and thereby bagging would not reduce the variance substantially. If we can manage each split to work on the subset of the dataset we possible can overcome the challenge. And that’s what typically a random forest does. We de-correlate the trees thereby making the average less variable and more reliable. The main difference between bagging and random forest is the choice of predictor subset size m. When m = p it’s bagging and when m=√p its Random Forest.

A plot to find the no of trees ; how error decreases with no. of tree

Random forest can thus be considered as an collection of decision trees. It builds and combines multiple decision trees in order to enhance accuracy. The name random because predicators are chosen randomly and forest because multiple decision trees are used to make prediction/ decision. Random forest helps in overcoming overfitting and make the model robust through its characteristics.

As simple approach to random forest algorithm

A simple R code approach
  1. Take a random sample of m observations from the training set.
  2. Make a decision tree from the bootstrap sample.
  3. At each node randomly select f features.
  4. Split the node based on the feature that makes best split according to the objective function e.g information Gain,
  5. Repeat the above steps k times ( k being the number of trees we want to create using subset)
  6. Aggregate the predicted values derived from each tree for a new data point to form a new data point to assign the class label by majority vote.

Some advantages and disadvantages of Random Forest

  1. Easy to compute
  2. Can efficiently process data
  3. Missing value, Outliners, don’t hamper the output

However : Prove to overfitting and cannot predict the value beyond the range of training dataset.

Boosting is another approach for improving the predictions from a decision tree. While the general approach is same as bagging — instead of the fact that in bagging each tree is built on a bootstrap dataset in boosting trees are grown sequentially which means each tree is grown based on the previous tree.

Boosting has three tuning parameters :

  1. The number of trees B. We use cross validation to select B.
  2. The shrinkage parameters (Lambda) : A small positive number this controls the rate at which the model learns.
  3. The number d of splits in each tree.

— — — — — — — — — — — — @ — — — — — — — — — — — — — — — -

--

--