×
☰ Menu

Random Forest Model

Random forest is an ensemble classifier, i.e. a combining classifier that uses and combines many decision tree classifiers. Ensembling is usually done using the concept of bagging with different feature sets. The reason for using large number of trees in random forest is to train the trees enough such that contribution from each feature comes in a number of models. After the random forest is generated by combining the trees, majority vote is applied to combine the output of the different trees. The result from the ensemble model is usually better than that from the individual decision tree models.

Working of Random Forest

The random forest algorithm works as follows:

  1. If there are N variables or features in the input data set, select a subset of 'm' (m < N) features at random out of the N features. Furthermore, the observations or data instances ought to be chosen at random.
  2. Use the best split principle on these 'm' features to calculate the number of nodes 'd'.
  3. Keep splitting the nodes to child nodes till the tree is grown to the maximum possible extent.
  4. Select a different subset of the training data 'with replacement' to train another decision tree following steps (1) to (3). Repeat this to build and train 'n' decision trees.
  5. Final class assignment is done on the basis of the majority votes from the 'n' trees.

Out-of-bag (00B) error in random forest

We have seen that in random forests, each tree is built using a different bootstrap sample taken from the original data. The samples left out of the bootstrap and not used in the construction of the i-th tree can be used to measure the performance of the model. At the end of the run, predictions for each such sample evaluated each time are tallied, and the final prediction for that sample is obtained by taking a vote. The total error rate of predictions for such samples is termed as out-of-bag (OOB) error rate.

The error rate shown in the confusion matrix reflects the OOB error rate. Because of this reason, the error rate displayed is often surprisingly high.

Strengths of Random Forest

  • It runs efficiently on large and expansive data sets.
  • It has a robust method for estimating missing data and maintains precision when a large proportion of the data is absent.
  • It has powerful techniques for balancing errors in a class population of unbalanced data sets.
  • It gives estimates (or assessments) about which features are the most important ones in the overall classification.
  • It generates an internal unbiased estimate (gauge) of the generalization error as the forest generation progresses.
  • Generated forests can be saved for future use on other data.
  • Lastly, the random forest algorithm can be used to solve both classification and regression problems.

Weaknesses of Random Forest

  • This model, because it combines a number of decision tree models, is not as easy to understand as a decision tree model.
  • It is computationally much more expensive than a simple model like decision tree.

Application of Random Forest

A very powerful and effective classifier called random forest combines the adaptability of numerous decision tree models into a single model. This ensemble model is becoming widely used and well-liked among machine learning practitioners to address a variety of classification issues as a result of the superior results.