Aim: Perform following Data Pre-processing (Feature Selection/Elimination) tasks using python
Theory: Feature selection one of the most important steps in machine learning. It is the process of narrowing down a subset of features to be used in predictive modeling without losing the total information. Sometimes, feature selection is mistaken with dimensionality reduction. Both methods tend to reduce the number of features in the dataset but in a different way. Dimensionality reduction reduces the number of features by creating new features as combinations of existing ones. All the features are combined to create a few unique features. Feature selection, on the other hand, works by eliminating the irrelevant features and only keeping the relevant ones.
Here are the main advantages of feature selection:
- It improves model performance: when you have irrelevant features in your data, these features act as a noise, which makes the machine learning models perform poorly.
- It leads to faster machine learning models.
- It avoids overfitting, which increases model generalisability.
Multicollinearity: Checking for multicollinearity is a very important step during the feature selection process. Multicollinearity can significantly reduce the model’s performance. Removing multicollinear features will both reduce the number of features and improve the model’s performance.
Univariate feature selection: Univariate feature selection works by selecting the best features using univariate statistical tests such as chi-square. It examines each feature individually to determine the strength of the relationship of the feature with the response variable. SelectKBest is one of the univariate methods which removes all but the specified number of highest scoring features.
Recursive feature elimination (RFE): Unlike the univariate method, RFE starts by fitting a model on the entire set of features and computing an importance score for each predictor. The weakest features are then removed, the model is re-fitted, and importance scores are computed again until the specified number of features are used. Features important score are ranked by the model’s coef_ or feature_importances_ attributes, and by recursively eliminating a small number of features per loop.
Recursive feature elimination with cross-validation (RFECV): RFE requires a specified number of features to keep, however it is usually not known in advance how many features are valid. To find the optimal number of features, cross-validation is used with RFE to score different feature subsets and select the best scoring collection of features.
Data Description:
The dataset corresponds to classification tasks on which you need to predict if a person has diabetes based on 8 features.
First, we will implement a Chi-Squared statistical test for non-negative features to select 4 of the best features from the dataset.
We can see the scores for each attribute and the 4 attributes chosen (those with the highest scores): plas, test, mass, and age.
- Recursive Feature Elimination:
The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.
We can see that RFE chose the top 3 features as preg, mass, and pedi. These are marked True in the support array and marked with a choice “1” in the ranking array. This, in turn, indicates the strength of these features.
For Code: Feature Selection using Python
No comments:
Post a Comment