Feature Engineering for Machine Learning
Feature engineering is the process that takes raw data and transforms it into features that can be used to create a predictive model using machine learning or statistical modeling, such as deep learning. Feature engineering aims to prepare an input data set that best fits the machine learning algorithm and enhance the performance of machine learning models. Feature engineering can help data scientists by accelerating the time it takes to extract variables from data, allowing for more variables. Automating feature engineering will help organizations and data scientists create models with better accuracy.
Here, the need for feature engineering arises. Feature engineering efforts mainly have two goals:
- Preparing the proper input dataset, compatible with the machine learning algorithm requirements.
- Improving the performance of machine learning models.
Data scientists spend 80% of their time on data preparation. (Forbes)
What are the techniques of data pre-processing and feature engineering?
Datasets, for various reasons, may have missing values or empty records, often encoded as blanks or NaN. Unfortunately, most of the machine learning algorithms are not capable of dealing with missing or blank values. Removing samples with missing values is a basic strategy that is sometimes used, but it comes with the cost of losing probable valuable data and the associated information or patterns. Missing values is one of the most common problems you can encounter when preparing your data for machine learning. There are many solutions for missing data. The row and column with missing data can be deleted. There is a possibility that the values deleted are important for modeling. So that the column you deem unimportant can be explanatory for the model, if your data is not numerically large, you can try to imputation it instead of deleting it. An imputation is a preferable option rather than dropping because it preserves the data size. The missing values might be replaced with 0 as long as you think it is a sensible solution. (for numeric values.)
For numeric variables, filling with mean and median is an option. (Statistically, the median is better in most cases. However, as the averages of the columns are sensitive to the outlier values, while medians are more solid in this respect.)
The mode is used for categorical variables. But what if we don’t have mode value? Then try assigning with grouping operations.
An outlier observation lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. Before abnormal observations can be singled out, it is necessary to characterize normal observations. Outlier is commonly used by analysts and data scientists as it needs close attention; else, it can result in wildly wrong estimations. Descriptive statistics or data visualization can be used for outlier observations. With data visualization, outlier observations can be seen easily, but explaining this requires descriptive statistics. It will detect outliers using descriptive statistics, standard deviation, and percentiles. We don’t want the standard deviation to be much larger or smaller than the mean. The difference between the percentages should also be numerically significant. For example, there should not be big differences between 95% and 99% values. That is, the situation is the same for other percentages close to each other. If there are big differences, it means an outlier. The most sensible thing to do for the outlier is to cap them instead of dropping them. On the other hand, capping can affect the data distribution; thus, it is better not to exaggerate it. For this, it is necessary to give good percentages to the upper and lower limits. In the literature, 25% and 75% values are important for upper and lower values. However, 5% and 95% values are better not to make big changes in data.
Note: Sometimes, variables alone do not represent outliers, but their intersection with different variables can create an outlier. (Local Outlier Factor)
Note: For some machine learning models, missing and outliers do not matter. (Example: Decision Tree methods)
Data binning is the process of grouping individual data values into specific bins or groups according to defined criteria.
The method or methods to be applied in this section do not constitute rules. Consider a case like this; an essential variable is missing in the data. Perhaps the required variable can be found by processing other variables. Your data is too small in another case, and you are afraid the model might not understand it. See the correlation between variables. Derive new variables with highly correlated variables. Proportional calculations, additions, etc., can be made to these variables. The main motivation of binning is to make the model more robust and prevent overfitting. Therefore, care should be taken when generating new variables. Excessive and irrelevant variables can complicate the model. If we are insufficient to derive new variables, we can also reduce the model success this time.
What to do? We can model the derived variables and look at their feature importance. Second, we can delete unnecessary variables derived. But please remember the correlation between the derived variables. Be careful when deleting.
4- Label Encoding
A machine can only understand numbers. Categorical encoding is a process of converting categories to numbers. Usually preferred for 2-category variables. Because it can break scaling into more than two variables. (Nominal-Ordinal.) If the assignment is to be made ordinal, this method can be used with more than two variables. This method is not only for categorical variables. Numeric variables can also be transformed into categorical variables. Assigns 1 to the first value it sees.
This method changes your categorical data, which is challenging to understand for algorithms, to a numerical format and enables you to group your categorical data without losing any information. Every unique value in the category will be added as a feature. This method spreads the values in a column to multiple flag columns and assigns 0 or 1. (Using parameters, different numerical values can be assigned.) One-Hot Encoding is the process of creating dummy variables and better suited for more than two variables. The problem here is the dummy variables trap. Dummy Variable Trap is a scenario in which variables are highly correlated to each other. Multicollinearity occurs where there is a dependency between the independent features. So, to overcome the problem of multicollinearity, one of the dummy variables has to be dropped. (use this argument: drop_first=True)
Note: If we assign the ordinal variables ourselves instead of this, we will affect the model score.
Note: Rare variables alone cannot make sense of data. So we can combine these weak variables and create a strong variable. (Rare Encoding.)
In statistics, standardization is the process of putting different variables on the same scale. This process allows you to compare scores between different types of variables. Standardization is a scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling. The effects of outliers increase in standardization and normalization processes. For this reason, outlier values should be intervened before these procedures.
Scaling vs. Normalization: What’s the difference?
One of the reasons that it’s easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar! In both cases, you’re transforming the values of numeric variables so that the transformed data points have specific helpful properties. The difference is that, in scaling, you’re changing the range of your data, while in normalization, you’re changing the shape of the distribution of your data.
Note: At the end of the day, the choice of using normalization or standardization will depend on your problem and the machine learning algorithm you are using. There is no hard and fast rule to tell you when to normalize or standardize your data. You can always start by fitting your model to raw, normalized, and standardized data and compare the performance for best results.
Data Cleaning Challenge: Scale and Normalize Data
Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources