×
☰ Menu

Feature Subset Selection

The most important pre-processing step in any machine learning project is probably feature selection. It aims to choose a subset of system features or attributes that contributes most significantly to a machine learning activity. To comprehend the theory behind feature selection, let's quickly go over a real-world example. A "student weight" data set contains information that can be used to predict a student's weight based on historical data about similar students. Roll Number, Age, Height, and Weight are among the characteristics included in the student weight data set. It makes perfect sense that roll number cannot in any way be used to predict student weight. In order to build a feature subset for this machine learning problem, we can therefore remove the feature roll number. One expects a subset of features to produce better results than the entire set.

Issues in high-dimensional data

The amount of data generated has increased to an unbelievable degree as a result of the quick technological advancements in the digital sphere. At the same time, advances in storage technology have reduced the cost of storing large amounts of data. This has increased the motivation for very large and high-dimensional data sets to be stored and mined. Two brand-new application domains have also experienced rapid growth. One is biomedical research, which uses microarray data for gene selection. The second one is text categorization, which handles massive amounts of text data from emails, social networking sites, etc. The first field, which is biomedical research, produces data sets with a few tens of thousands of features.

The dimensions of text data produced from various sources are also very high. The number of distinct word tokens that represent the feature of the text data set in a large document corpus with a few thousand embedded documents can also be in the range of a few tens of thousands. Any machine learning algorithm may find it difficult to draw conclusions from such high-dimensional data. On the one hand, a significant amount of time and computational resources will be needed. On the other hand, unneeded noise in the data causes a sharp decline in the model's performance for both supervised and unsupervised machine learning tasks. Additionally, a model with a very large number of features may be very challenging to comprehend. It is necessary to use a subset of the features rather than the entire set because of this.

The objective of feature selection:

  • Having faster and more cost-effective (i.e. less need for computational resources) learning model
  • Improving the efficiency of the learning model
  • Having a better understanding of the underlying model that generated the data