Feature Engineering
You can give an outstanding algorithm terrible features and get a terrible model. You can give a mediocre algorithm excellent features and get a surprisingly good one. Feature engineering — the process of selecting, transforming, and constructing the inputs a model receives — is one of the most impactful and most underappreciated stages of the ML pipeline. In the era before deep learning automated much of this work for images and text, feature engineering was considered the primary craft skill of machine learning practitioners. Even today, for tabular and structured data — the most common type in industry — thoughtful feature engineering remains decisive.
What Is a Feature?
A feature is one measurable input variable used by a model to make a prediction. In a tabular dataset, each column is a candidate feature. The model never sees the raw real-world object — it sees only the features you provide. Features must be numerical, or convertible to numerical form, because models compute with numbers. A raw column like 'city of residence' is a string; it must be encoded before a model can use it. A common encoding for categorical variables is one-hot encoding: create one binary column for each possible category. If 'city' has five possible values — Austin, Boston, Chicago, Denver, and Eugene — you replace the single 'city' column with five binary columns, one per city, each taking the value 1 if this row's city is that city and 0 otherwise. Only one column is 1 for any given row. Date columns are another example. A raw date like '2024-03-15' is not directly useful to most models. But decomposing it into 'day of week' (0-6), 'month' (1-12), 'is weekend' (0/1), and 'days since account creation' extracts meaningful structure that the model can actually learn from.
There is no single correct way to represent a raw input as features. Every representation encodes assumptions about what matters. A date-of-birth column can become raw age in years, a decile bucket, or a set of binary flags for different life stages. Each representation lets the model see a different structure in the data. The best representation depends on what pattern you believe the model needs to detect.
Beyond encoding, feature engineering includes constructing new features from combinations of existing ones. Consider a dataset where each row represents a loan application, with columns for requested amount and applicant annual income. Neither column alone captures the key relationship; the ratio — requested amount divided by income, called the debt-to-income ratio — is a single engineered feature that compresses both into a more predictive signal. Credit analysts have used this ratio for decades; encoding it as a feature gives the model access to that domain knowledge directly. Another example: a model predicting electricity demand might have a temperature column. But the relationship between temperature and demand is not linear — demand spikes both in very hot and very cold weather. Adding a 'temperature squared' feature allows a linear model to capture this U-shaped relationship, which it otherwise could not represent.
Feature Selection
More features are not always better. Including irrelevant or redundant features adds noise, increases training time, and can actually hurt model performance by giving the model spurious patterns to overfit to. Feature selection is the process of choosing which candidate features to include. The simplest approach is domain expertise: consult people who understand the problem and ask which variables are actually related to the outcome. A domain expert can rule out features that are correlated with the target only by accident in the historical data. Statistical approaches measure the association between each feature and the target variable. Features with near-zero correlation to the target across the training data are candidates for removal. Features that are highly correlated with each other are candidates for consolidation — keeping both adds little information while doubling the complexity. A third approach is model-based selection: train a model that produces a natural importance score for each feature — decision trees and gradient boosting models do this well — and drop the features scored as least important. This method captures interactions that simple correlation misses, but it requires training a model first. The curse of dimensionality is a formal phenomenon that makes feature selection especially important at high feature counts: as the number of features grows, the amount of data needed to learn reliably from them grows exponentially. A model with 100 features needs far more than 100 times the training data of a model with 1 feature, because the space of possible input combinations becomes vast.
Target leakage occurs when you include a feature that contains information about the target that would not be available at prediction time. Example: if you are predicting whether a customer will cancel their subscription, and you include 'number of cancellation-request emails sent this month' as a feature, you are encoding the outcome directly into the input. The model achieves high accuracy in training but is useless in production because you do not know this value before the cancellation happens.
Prompt Challenge
Write a prompt asking an AI assistant to help you brainstorm engineered features for a specific prediction problem.
Your prompt should…
- Describe the prediction goal and what outcome you want to predict
- Mention the raw data columns that are available as starting material
- Ask for creative feature combinations or transformations that might improve prediction
A model predicts house prices. The dataset includes 'number of rooms' and 'total square footage.' An engineer adds the feature 'square footage per room.' What is the best justification for this addition?
What is target leakage, and why is it dangerous?
Engineer Features for a Real Problem
- You are building a model to predict whether a social media post will go viral (defined as exceeding 10,000 shares within 24 hours).
- You have the following raw columns: post text, post timestamp, author account age (in days), author follower count, number of hashtags, presence of an image (yes/no), and historical average shares per post for this author.
- Step 1: List at least four engineered features you would create from these raw columns. For each, write the formula or transformation and one sentence explaining the predictive rationale.
- Step 2: Identify two features from the raw list that might be redundant with each other and explain why.
- Step 3: Identify one feature that might constitute target leakage if added carelessly — something you would be tempted to include but should not — and explain why it would be unavailable at prediction time.
- Step 4: Rank your engineered features by your intuition of their predictive importance and justify your top choice.