What is Feature Engineering?

What is Feature Engineering?

(upbeat electronic music) (logo whooshing) – Hi, Feature engineering is a process of taking unrefined raw data and converting it into meaningful features that your model can understand better, and provide a better decision boundary. To give an example, think if you are a retailer, and you want to target your customer who you think will not do
business from you again. So basically what you are doing, you have your customer information you have your customer
transaction information, you are taking this raw data and trying to create features
to understand the recency of the customer visit. The recency can be today’s
date minus the last date, the customer made a purchase. That can be the recency. The frequency of customer purchase. So basically number of times the customer purchased with you for
the last seven days, number of times the customer purchased in the last 14 days last
30 days last last 60 days. So basically you’re taking the data and creating multiple features to understand the recency
of the customer visit and the frequency of the customer visit. So basically, by creating such features
what you are feeding to the model, the model may be able to better understand whether the customer will
customer will come back and purchase from you or not. So that’s that’s what
transforming your raw data into meaningful insight. Now, coming back to feature engineering, the feature engineering can arise out of your domain understanding, you might have some understanding
of your business already, and you want to incorporate
that understanding as features into your model. It can come from your data analysis or exploratory data
analysis phase as well. When you are taking the raw data
and you’re exploring the data, you may find some insight which you can convert into features. Sometimes the features can also come from an external data provider. So, typically like you are, you are kind of trying to measure if a particular
customer will default, and you have the customer information and you want to kind of maybe use some external third party provider who can give you more
information about the customer, like what is the customer
external delinquency rate, whether the customer has
filed bankruptcy or something like that. So, the features can also be
from an external data provider. There are two steps in
feature engineering. The first step is more algorithm specific, or you can also call it as
data pre-processing step. So most algorithm expects
the data to be perfect for it to work correctly and efficiently. In some models are more
sensitive to outliers. Some models may be more, some models will work better
if the data is scaled, like gradient descent
algorithm converges faster if the data is scaled for it. Some algorithm might require
your categorical values rather like most algorithm
requires your categorical values to be numerically encoded. And most specifically some algorithm , requires this category encoded
value to be one hot encoded. So, that no order in
the data is maintained. right? to give an example, if you take this particular scenario, the chart you see and data with a data points
with an outlier in it, and if you see the
regression line basically, because of the outliers the
regression line is distorted and the slope is pointing
towards the outliers and basically this model
as an high residual error. Whereas, after the outlier treatment, if you see the other model after the outliers are removed, the data fits the line better. So, so, sometimes like you are
a model may be more sensitive to outlier and you want to
do an outlier treatment. The second part , is feature engineering
from domain understanding. So, this can be feature that
represent time aggregates or events. It can be your customer behavior pattern or the customer journey that
led to your business. It can be a count frequency or
ratio of a particular entity that you are trying to model. It can be like bucketing
your data in such a way that a nonlinear relationship
can be made linear so that your model can
understand the data better. Now, there are plenty of scenarios there are like countless scenarios in which you can do future engineering and there is no predefined
way or better way. It all depends on your
creativity and curiosity when you start analyzing your
data and try to implement it. The benefit of feature engineering is you can keep your model
as simple as possible. Even a simple algorithm with
the right set of engineered data can give pretty high lift in performance. The second thing is you
get better explainability of your model. Since you know what
features you have created, you can create the more explainable model. The third, is you can remove
unwanted bias in the model. Right, it can be kind of having , the data can also be under represented, scenarios or something like that, which you want to handle, you can handle it through feature engineering, so that your data outcome is not biased , so that your model outcome is not biased. In the next set of videos, I’ll be talking about some
scenarios of feature engineering. As I said, feature engineering
really comes from experience and the data in hand, that is no the right way of
doing feature engineering. It all depends on your
exploratory data analysis phase and how you want to model
that particular feature so that your model can
come at a better outcome. I’ll talking to some of the
scenarios so stay tuned. Thank you.

You May Also Like

About the Author: Oren Garnes


  1. Hi sir , would like to hear your opinion on the approach I usually take when dealing with a machine learning problem not belonging to my domain.
    1) read and understand the problem thoroughly
    2) data preprocessing(null values,one hot coding etc)
    3)researching about what factors may help the model to predict in a better way through proper research if the problem is not of my domain.
    4) try to implement those features using the current variables I have.
    5) create new features using exploratory data analysis
    6) using features tools to create some more features (many must be correlated but still)
    7) using an automl framework to assist me in finding out the best pipeline suited to solve my problem

Leave a Reply

Your email address will not be published. Required fields are marked *