Feature Engineering Overview
Feature engineering refers to the extraction of useful information from a data set and the creation of input features that machine learning models can use to make predictions and classifications, and is an important process in the context of machine learning and data analysis.
The main points of feature engineering are described below.
- Feature Selection: This is the process of selecting useful features from among the features in a dataset that contribute to a prediction. The goal of feature selection is to reduce model complexity and improve computational efficiency by eliminating redundant features.
- Feature Transformation: This is the process of transforming the features in a dataset so that the model can better capture patterns. This includes, for example, normalizing numerical data or encoding categorical features into numerical values.
- Feature Generation: This is the process of generating new features from existing features in a dataset. For example, this process includes the creation of new features by utilizing domain knowledge, such as extracting the day of the week, month, and season from date information.
- Feature Scaling: This is the process of adjusting the range of feature values. Some machine learning algorithms are sensitive to feature scaling and require this process. A common method is Standardization, which adjusts the mean of the features to 0 and the standard deviation to 1.
Each of these is discussed in detail below.
feature selection
<Overview>.
Feature selection is one of the key steps in feature engineering. Feature selection is the process of selecting useful features from a given data set that contribute to a prediction.
- Filter Methods: Filter methods utilize statistical relationships among features to select features. Common measures include the mutual information content, correlation coefficient, and chi-square statistic. The advantage of this method is that it is relatively easy to calculate and can clearly evaluate the independence and correlation between features.
- Wrapper Methods: Wrapper methods are used for feature selection with the goal of maximizing model performance. Specifically, a model is constructed using a subset of features, and the optimal combination of features is searched for through cross-validation and recursive feature elimination. The advantage of the wrapper method is that it is expected to improve the performance of the final model, but it also has the disadvantage of high computational cost.
- Embedded Methods: Embedded methods are methods in which feature selection is built into the machine learning algorithm itself. Examples include logistic regression with L1 regularization and decision tree-based algorithms, which can evaluate the importance of features and select only those features that are important.
The advantages of feature selection can be summarized as follows
- It reduces the complexity of the model, thus reducing computational cost and memory usage.
- Model performance may be improved because selected features contribute to predictions.
- The reduced number of features may facilitate model interpretation and visualization.
However, feature selection depends on the nature of the data set and the problem, so choosing the appropriate methods and metrics is important, and it is important to try different methods and utilize cross-validation and domain knowledge to select the appropriate features.
We discuss these implementations below.
<Example of Filter Method Implementation>
The filter method is a method of feature selection that uses statistical relationships among features. Below is an example implementation of feature selection using the correlation coefficient, one of the filter methods.
import pandas as pd
import numpy as np
import seaborn as sns
# Data Set Read
data = pd.read_csv('dataset.csv')
# Partitioning of objective variables and features
X = data.drop('target', axis=1) # feature value
y = data['target'] # Objective variable
# Calculation of correlation coefficients
correlation_matrix = X.corr()
# Visualization of correlation coefficients (heat map)
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
# Select features for which the absolute value of the correlation coefficient is greater than a threshold value
threshold = 0.5 # Threshold setting
selected_features = correlation_matrix[abs(correlation_matrix) >= threshold].dropna(axis=0, how='all').dropna(axis=1, how='all').columns
# Display of selected features
print(selected_features)
In the above example, the dataset is read from the file dataset.csv, and the feature matrix X and the objective variable y are created. Next, the correlation coefficient matrix between the features of X is calculated and visualized as a heat map, the features whose absolute values of the correlation coefficients are above a threshold are selected and stored in selected_features, and finally, the selected features are displayed.
Since the correlation coefficient evaluates linear relationships, it may not be suitable for selecting features with non-linear relationships, and since the threshold value of the absolute value of the correlation coefficient varies from problem to problem, it is important to set an appropriate threshold value. The filter method is one method of feature selection and may be used in combination with other methods.
<Example of Wrapper Method Implementation>
Wrapper methods are used to maximize the performance of models for feature selection. The following is an example of a wrapper method using Recursive Feature Elimination (RFE). In this example, a logistic regression model is used for feature selection.
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# Data Set Read
data = pd.read_csv('dataset.csv')
# Partitioning of objective variables and features
X = data.drop('target', axis=1) # feature value
y = data['target'] # Objective variable
# Setting up a logistic regression model
model = LogisticRegression()
# Feature Selection by RFE
num_features_to_select = 5 # Number of features to select
rfe = RFE(estimator=model, n_features_to_select=num_features_to_select)
selected_features = rfe.fit_transform(X, y)
# Get the index of the selected feature
feature_indices = rfe.get_support(indices=True)
# Display of selected features
selected_feature_names = X.columns[feature_indices]
print(selected_feature_names)
In the above example, the dataset is read from the file dataset.csv, the feature matrix X and the objective variable y are created, then the logistic regression model is set up, feature selection is performed using RFE, and the number of features to select is specified in n_features_to_select. Finally, the index of the features selected by RFE is obtained, and the names of the features are displayed based on the index.
Since the wrapper method tries combinations of features, the computational cost may be high, so care should be taken depending on the number of features and the size of the data set.
<Example of Embedded Method Implementation>
The built-in method is a method in which the feature selection function is built into the machine learning algorithm itself. Below is an example implementation of the built-in method using ExtraTreesClassifier, a decision tree-based feature selection method.
from sklearn.ensemble import ExtraTreesClassifier
# Data Set Read
data = pd.read_csv('dataset.csv')
# Partitioning of objective variables and features
X = data.drop('target', axis=1) # feature value
y = data['target'] # Objective variable
# Feature Selection with ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X, y)
# Get the importance of a feature
feature_importances = model.feature_importances_
# Sorting by feature importance in descending order and selecting the top features
num_features_to_select = 5 # Number of features to select
top_feature_indices = feature_importances.argsort()[-num_features_to_select:][::-1]
selected_feature_names = X.columns[top_feature_indices]
# Display of selected features
print(selected_feature_names)
In the above example, the dataset is read from the file dataset.csv, the feature matrix X and the objective variable y are created, then feature selection is performed using ExtraTreesClassifier, and the feature importance is obtained using the feature_importances_ attribute, sorting in descending order, obtaining the index of the top features, and displaying the names of the features based on that index.
The built-in method can obtain information such as feature importance and coefficients because the model itself performs feature selection. In this example, ExtraTreesClassifier was used, but other models and different built-in methods can also be used.
feature transformation
<Overview>.
Feature transformation in feature engineering is the process of transforming original features to create new representations. Feature transformation is used to improve model performance, to improve the semantics of features, and to capture nonlinear relationships. The following describes general methods of feature transformation.
- Scaling: Scaling is a method of transforming the scale of feature values. Common scaling methods include Standardization and Normalization. Standardization is a method of transforming feature values so that the mean is 0 and the standard deviation is 1. Normalization is a scaling method that uses the minimum and maximum values to keep feature values within a specific range.
- Polynomial Features: This method creates new features from a combination of features in order to capture the nonlinear relationship between features. In the Polynomial Features transformation, the order of the features is specified, and the products and powers of the features are calculated to generate new features.
- Log Transformation: This method transforms the feature values to a logarithmic scale. Log transformation narrows the range of feature values and thus has the effect of reducing the influence of outliers.
- Box-Cox Transformation: This transformation is used to make the distribution of positive-valued features closer to a normal distribution. The Box-Cox Transformation uses specific parameters to transform the distribution of features to a shape closer to the normal distribution.
- Categorical Feature Encoding: This is a method for converting categorical features into numerical data. Typical encoding methods include One-Hot Encoding, Label Encoding, and Ordinal Encoding.
These implementations are described below.
<Example of scaling implementation>
This section describes an example implementation using the scikit-learn library for scaling features.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Data Set Read
data = pd.read_csv('dataset.csv')
# Partitioning of objective variables and features
X = data.drop('target', axis=1) # feature value
y = data['target'] # Objective variable
# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Normalization
min_max_scaler = MinMaxScaler()
X_normalized = min_max_scaler.fit_transform(X)
In the above example, the dataset is read from the file dataset.csv, the feature matrix X and the objective variable y are created, and then standardization and normalization are performed. In Standardization, the StandardScaler class is used to scale the features to mean 0 and standard deviation 1, and the fit_transform method is called to obtain X_scaled, which is the standardized feature matrix X. Normalization (Normalization) uses the MinMaxScaler class to scale the features to a specific range (default range is 0 to 1). It calls the fit_transform method to obtain X_normalized, which is the normalized feature matrix X.
<Example of Polynomial Feature Implementation>
This section describes an example implementation using the scikit-learn library to perform polynomial feature transformation.
from sklearn.preprocessing import PolynomialFeatures
# Data Set Read
data = pd.read_csv('dataset.csv')
# Partitioning of objective variables and features
X = data.drop('target', axis=1) # feature value
y = data['target'] # Objective variable
# polynomial feature transformation
degree = 2 # Degree of polynomial
poly = PolynomialFeatures(degree=degree, include_bias=False)
X_poly = poly.fit_transform(X)
In the above example, a dataset is read from the file dataset.csv, a feature matrix X and an objective variable y are created, and then a polynomial feature transformation is performed. The polynomial feature transformation uses the PolynomialFeatures class to transform the feature matrix X into a polynomial form, and the degree parameter specifies the degree of the polynomial to be transformed. By calling the fit_transform method on it, X_poly is obtained, which is a polynomial transformation of the feature matrix X. This transformation generates new features from the original combination of features.
Polynomial feature transformation is a useful method for capturing nonlinear relationships. However, the higher the order, the greater the number of features generated, which can lead to an explosion in the number of features.
<Example of Box-Cox transform implementation>
This section describes an example implementation of the Box-Cox transform using the scipy library.
from scipy import stats
# Data Set Read
data = pd.read_csv('dataset.csv')
# Partitioning of objective variables and features
X = data.drop('target', axis=1) # feature value
y = data['target'] # Objective variable
# Box-Cox Conversion
transformed_features = []
for feature in X.columns:
transformed_feature, _ = stats.boxcox(X[feature])
transformed_features.append(transformed_feature)
X_transformed = pd.DataFrame(transformed_features, index=X.columns).T
In the above example, a dataset is read from the file dataset.csv, a feature matrix X and an objective variable y are created, and then a box-Cox transformation is performed. The box-Cox transformation transforms the features using the boxcox function in the scipy.stats module, uses a loop to perform the transformation on each feature, and adds the transformed features to the list transformed_features. The boxcox function returns a tuple of transformed features and transformation parameters. The boxcox function returns a tuple of transformed features and transformation parameters, but in this example the transformation parameters are ignored (received as _). Finally, the transformed features are summarized in a DataFrame and obtained as X_transformed.
The Box-Cox transform is used to bring the distribution of positive-valued features closer to a normal distribution. However, since the transformed features may lose the meaning of the original features, it is important to select appropriate transformation parameters and verify the effect of the Box-Cox transform.
<Encoding of categorical features>
This section describes an example implementation using the scikit-learn library for categorical feature encoding.
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# Data Set Read
data = pd.read_csv('dataset.csv')
# Partitioning of objective variables and features
X = data.drop('target', axis=1) # feature value
y = data['target'] # Objective variable
# One-Hot Encoding
onehot_encoder = OneHotEncoder()
X_encoded = onehot_encoder.fit_transform(X)
# Label Encoding
label_encoder = LabelEncoder()
X_encoded = X.apply(label_encoder.fit_transform)
In the above example, the dataset is read from the file dataset.csv, the feature matrix X and the objective variable y are created, and then the categorical features are encoded. One-Hot Encoding (One-HotEncoder) uses the OneHotEncoder class to convert the categorical features into a vector of binaries, and then calls the fit_transform method to obtain a one-hot encoding of the feature matrix X, X_ encoded is obtained.
In Label Encoding, the LabelEncoder class is used to encode categorical features into integers, and by calling the fit_transform method, X_encoded, a label encoding of the feature matrix X, is obtained. In this method, the ordinal relationship of the features is taken into account, but the integer values are not meaningful.
Which encoding method is used depends on the nature of the data and the requirements of the model. One-hot encoding creates a sparse representation of the features but can lead to an explosion in dimensionality due to the increased number of features. Label encoding takes into account the ordinal relationship of the features while encoding, but may be misinterpreted as meaningful for integer values. To select these appropriate encoding methods, data characteristics and model requirements should be considered.
Feature Generation
<Overview>
Feature generation (Feature Generation) is the process of creating new features from existing features. This allows for the creation of more expressive features, improving model performance and gaining insight into problem solving.
The following methods and ideas are utilized for feature generation
- Arithmetic operations: New features are generated by combining existing features and performing arithmetic operations. This includes, for example, operations such as the sum, difference, product and division of two feature values.
- Polynomial features: Polynomial features are generated by increasing existing features to a specified degree. This may make it easier for the model to capture nonlinear relationships among features.
- Binning: A continuous-valued feature is divided into intervals to create new categorical features. For example, by dividing age into bins such as “teens,” “20s,” “30s,” etc., features representing age groups are generated.
- Time features: New features are generated by extracting information such as day of the week, month, season, and time of day from the features related to date and time. This allows for the capture of time patterns and trends.
- Use of domain knowledge: Generate new feature quantities by leveraging knowledge and insights in a specific domain. This will enable the creation of features that have business meaning, for example, calculating average prices and sales per category from product features.
In feature generation, it is important to select the appropriate method according to the characteristics of the data and the requirements of the problem, as excessive feature generation can lead to overlearning. Feature generation can be one of the beneficial methods for improving model performance and solving problems.
<Example of Arithmetic Operation Implementation>
An example implementation of feature generation using arithmetic operations is shown below. In the following example, the pandas library is used to generate new features from two existing features.
import pandas as pd
# Data Set Read
data = pd.read_csv('dataset.csv')
# Partitioning of objective variables and features
X = data.drop('target', axis=1) # feature value
y = data['target'] # Objective variable
# Feature generation by arithmetic operations
X['sum'] = X['feature1'] + X['feature2'] # Generates new features by computing the sum of two features
X['difference'] = X['feature1'] - X['feature2'] # Generates new features by calculating the difference between two features
X['product'] = X['feature1'] * X['feature2'] # Generates new features by computing the product of two features
X['ratio'] = X['feature1'] / X['feature2'] # Generate new features by calculating the ratio of two features
In the above example, the dataset is read from the file dataset.csv, the feature matrix X and the objective variable y are created, and then new features are generated using arithmetic operations.
As an example, new features are generated from two existing features, feature1 and feature2, the sum of the two features is calculated using the + operator, the difference of the two features is calculated using the – operator, the product of the two features is calculated using the * operator, and the ratio of the two features ratio of the two features. These operations are performed and added to X as new features.
<Example of polynomial feature implementation>
An example implementation for generating polynomial features is shown below. In the following example, the PolynomialFeatures class of the sklearn.preprocessing module is used to generate polynomial features.
from sklearn.preprocessing import PolynomialFeatures
# Data Set Read
data = pd.read_csv('dataset.csv')
# Partitioning of objective variables and features
X = data.drop('target', axis=1) # feature value
y = data['target'] # Objective variable
# polynomial feature generation
degree = 2 # Degree of polynomial
poly_features = PolynomialFeatures(degree=degree, include_bias=False)
X_poly = poly_features.fit_transform(X)
In the above example, the dataset is read from the file dataset.csv, the feature matrix X and the objective variable y are created, and then polynomial features are generated using the PolynomialFeatures class. The degree parameter of the PolynomialFeatures class is set to specify the degree of the polynomial, and in the above example, a polynomial feature of degree 2 is generated, This transformation generates polynomial features that include interaction terms and higher order terms from the original features.
Polynomial features are used to represent nonlinear relationships, but as the degree increases, the number of feature combinations increases, which may cause an explosive increase in dimensionality. Therefore, it is important to select an appropriate degree and use regularization methods to prevent over-training of the model. In addition, feature scaling and preprocessing may be necessary when generating polynomial features.
<Example of Binning Implementation>
An example implementation for binning features using Binning is shown below. In the following example, features are binned using the pandas library.
import pandas as pd
# Data Set Read
data = pd.read_csv('dataset.csv')
# Partitioning of objective variables and features
X = data.drop('target', axis=1) # feature value
y = data['target'] # Objective variable
# Set bin boundary values
bin_edges = [0, 10, 20, 30, 40] # Boundary value of bin
# Feature generation by binning
X_binned = pd.cut(X['feature1'], bins=bin_edges, labels=False, include_lowest=True)
In the above example, the dataset is read from the file dataset.csv, the feature matrix X and the objective variable y are created, and then the boundary values for the bins are set. The boundary values determine how the features are divided into bins. In addition, the pd.cut function is used to bin the feature feature1, specifying the bin boundary values in the bins parameter and setting labels=False returns the bin index (an integer starting from 0). include_lowest=True sets the minimum value to The bin range is set to include.
Thus, binning allows for the categorical division of continuous-valued features into categorical bins. This reduces the amount of information in the features, thus reducing the computational complexity of the model. One challenge is that if the width of the bins is small, detailed feature information is retained, but the risk of over-training the model increases, and conversely, if the width of the bins is too large, the loss of information may be significant. In order to set appropriate bins to address these issues, it is necessary to consider data distribution and domain knowledge.
<Example of Temporal Feature Implementation>
The implementation of temporal features in feature generation depends on the specific context, but the following steps can be considered as a general approach.
- Data Preparation: Acquire or generate time-related data. For example, this could be sensor data or time series data.
- Extraction of temporal features: Extract useful features from time information. Some common examples of time features are
- Day of the week: Features indicating the day of the week (Monday, Tuesday, etc.) can be extracted. You can use a function to convert date data to days of the week, or you can use One-Hot encoding.
- Time of day: Features indicating time of day (morning, afternoon, evening, night, etc.) can be extracted. Conditional branching and binning (creating intervals for each time period) can be performed based on time data.
- Month: Features representing the month can also be extracted. Functions can be used to extract months from date data or One-Hot encoding can be used.
- Holidays: Features representing the presence or absence of holidays or special events may also be extracted. This can be determined by using calendar data or a list of holidays.
- Feature merging: Combine the extracted time features with the original data. Usually, the feature vectors of the original data and the temporal features are combined to create a new feature vector.
- Model training: The model is trained using the data containing the features. Since the temporal features are included, the model can learn patterns related to time.
An example implementation using Python and the pandas library is shown below.
import pandas as pd
# Loading Data
data = pd.read_csv('data.csv')
# Convert date columns to date type
data['date'] = pd.to_datetime(data['date'])
# Extract features representing the day of the week
data['weekday'] = data['date'].dt.weekday
# Extract features representing time of day
data['hour'] = data['date'].dt.hour
data['time_of_day'] = pd.cut(data['hour'], bins=[0, 6, 12, 18, 24], labels=['Night', 'Morning', 'Afternoon', 'Evening'])
# Extract features representing the month
data['month'] = data['date'].dt.month
# Extracting holiday features
holidays = ['2023-01-01', '2023-12-25'] # List of holidays (example)
data['is_holiday'] = data['date'].isin(holidays)
# Feature Combination
features = data[['weekday', 'time_of_day', 'month', 'is_holiday', 'other_features']]
# Learning models using feature vectors
# ...
In the above example, the data frame named data contains date data, and time features such as weekday, time_of_day, month, is_holiday, etc. are extracted, and finally, the required features are selected for model training.
<Example of Implementation of Use of Domain Knowledge>
The use of domain knowledge in feature generation can be realized by creating unique features for specific domains or tasks. An example implementation of the use of domain knowledge in feature generation is shown below.
- Domain Knowledge Acquisition: Acquire knowledge about a specific domain. This can be done by collaborating with domain experts or by researching domain literature and data.
- Feature extraction for a specific domain: Use acquired domain knowledge to extract domain-specific features. Some examples are
- Keyword frequency: The frequency of occurrence of domain-specific keywords can be extracted. Keyword frequencies are calculated using natural language processing techniques on textual data.
- Categorical Features: Features related to specific categories can be extracted. This could be, for example, features that represent product categories or features related to user attributes.
- Graph-based features: features can be extracted from graph data representing relationships between entities in a domain, and this data can be used to compute features such as node centrality and clustering coefficients using graph theory and network analysis techniques.
- Domain feature merging: Combine the extracted domain features with the original data. Usually, the feature vectors of the original data and domain features are combined to create a new feature vector.
- Train the model: train the model using the data containing the domain features. The use of domain knowledge allows the model to learn more domain-appropriate features.
An example implementation using Python and the pandas library is shown below.
import pandas as pd
# Loading Data
data = pd.read_csv('data.csv')
# Feature extraction using domain knowledge
# Extract frequency of occurrence of keywords
data['keyword_frequency'] = data['text'].apply(lambda x: calculate_keyword_frequency(x, domain_keywords))
# Extraction of category features
data['category_feature'] = data['category'].apply(lambda x: map_category_to_feature(x, domain_category_mapping))
# Graph-based feature extraction
data['node_centrality'] = data['node_id'].apply(lambda x: calculate_node_centrality(x, domain_graph))
# Feature Combination
features = data[['keyword_frequency', 'category_feature', 'node_centrality', 'other_features']]
# Learning models using feature vectors
# ...
In the above example, the data frame called data contains text data, categorical data, and network data, and domain knowledge is used to extract the frequency of occurrence of keywords, categorical features, and the centrality of nodes in the network, and finally, the necessary features are selected for model training.
Feature Scaling
<Overview>
Feature Scaling in feature engineering is the process of equalizing the scale or range of different features. Feature scaling is one of the key methods to improve the performance of machine learning algorithms. The purpose of feature scaling and typical methods are described below.
Purpose:
- Eliminate model bias due to differences in feature scaling.
- To improve the stability of algorithms affected by feature scaling, such as gradient descent and distance-based algorithms.
- Satisfy some statistical model assumptions by ensuring that the distribution of features approaches a normal distribution.
Typical Methods:
- Standardization.
- Convert the mean of the features to 0 and the standard deviation to 1.
x’ = (x – mean(x)) / std(x) - Expresses how far the feature values are from the mean with respect to the mean.
- Convert the mean of the features to 0 and the standard deviation to 1.
- Normalization: Normalizes the value of a feature to a range of 0 to 1.
- Converts (or scales arbitrarily) the value of a feature to a range between 0 and 1.
- x’ = (x – min(x)) / (max(x) – min(x))
- Scales the value of a feature within its minimum and maximum values.
- Log Transformation.
- Applies a log transformation to the feature values.
- The log transformation increases the range of feature values and frees the distribution of the data from distortion.
Below is an example implementation of standardization and normalization using Python and the scikit-learn library.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# standardization
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
# normalization
normalizer = MinMaxScaler()
normalized_features = normalizer.fit_transform(features)
In the above example, features represents the feature matrix, which is standardized using the StandardScaler class and normalized using the MinMaxScaler class. Here, the features are transformed using the fit_transform method.
コメント