Regression -- the simplest form of machine learning. Speaking of simple things, often times the simplest solution is the best solution. Coming back to this post, this sample case study revolves around the problem of predicting an engagement.
Now there are many things that can affect a user's decision to engage with an advertisement. So, which of these variables do you pick when doing an analysis?
A common way to simplify the problem is to categorize the variables into different bigger features. For example, you could perform two different analysis: one on the effect of the user's available demographic information and another analysis on the effect of a user's past browsing history. Even within these analysis, one can get confused on how many sub-variables to keep. It's a common problem in data science, one that's called feature selection. One way to select only the significant features is to run a simple random-forest model on all variables. Many implementations of random-forest available online (such as sklearn's) have an attribute called feature_importance that gives a numeric score to each feature. You can then select the most significant features to perform your analysis on.
Data Balancing
Anyone who has worked with data in a commercial space knows that the real data can be very, very noisy. And in the case of online world, engagements are a rare event. This is the case of unbalanced data and don't make a good dataset for modeling. One good rule of thumb that I use when dealing with binary categorical data is that the un-rare event data size (i.e. the number of non-clicks in my data set) should not be greater than 3x the rare-event data size in my data set. Hence, if my dataset size is 1 million, I need to have at least 1/4th of the data size to be of the rare event category for good balance.
Training Data Size
A good rule of thumb is to use 80% of your dataset for training the model you will use and 20% for testing. It's a good idea to shuffle the dataset before splitting the dataset into test/train category, in case the dataset is arranged in a certain order.
After preparing the data set and ensuring that the right features/variables are used, it is time to pick the right model. One thing that always helps to know is the underlying probability distribution of the data. In the case of online engagement, the underlying probability distribution is binomial distribution where the parameters are defined as follows:
n = number of message views
p = underlying probability for a view to convert into an engagement (extra info: this p follows beta distribution)
In this case the output variable is a binary categorical one: engagement/non-engagement and the features are all numerical. Hence, it made sense to use logistic regression which, given numerical variables, outputs the probability of a success.
In one particular case, a pre-eliminary use of Statsmodels GLM (with family=binomial and link=logit) gave me a 62% accuracy rate (with the afore mentioned 20% of the testing set; 80% of the dataset was used to train the model). In this context, accuracy or the AUC score, above 0.6 is considered good enough.
When talking about logistic regression, it is important to talk about the threshold value. Logistic regression outputs the probability that a given case will result to a success case (an engagement). While testing, I needed to determine what the proper threshold of probability for a success case is. It might seem intuitive to say that if an outputted probability is above 0.5, it can be categorized as a success/engagement. However, that does not always gives the best results. Trial and error is the best way to go, and in my case, a threshold of 0.45 gave me the highest accuracy.
With a little help from my friends: Literature
In most cases, the problem in hand has already been studied by other companies/educational and research organizations. Becoming aware of the advancements in the problem space is definitely helpful. In the above case, after getting an AUC score of 62%, I was ready to end my analysis. However, I came across this paper from Facebook that suggests a hybrid model of using gradient boosted regression (gradient boosted decision trees + logistic regression). To elaborate, Facebook suggests a way of feature transformation using decision trees and using the transformed features in a logistic regression model. Here's a good visualization of the hybrid model:
A good approximate implementation of this hybrid model is xgboost. Using this model, my improved AUC score/accuracy was 0.7 (70%).
Tuning
Just picking the right model does not do the trick, usually. For any model, it's important that the parameters that determine how the model trains with the training data you have provided are well-tuned. For this, one needs to understand what the parameters mean and how they affect the model training. In the case of xgboost, following were some of the values I used for the parameters:
max_depth (Maximum depth of trees): 6
eta (the learning rate) :0.5
objective: binary:logistic
~~~~~~~~~~~~~~~~~
To conclude, data analysis is just not picking the right model. It's a lot of data cleaning and tuning and testing decisions. There is usually not a hard and fast answer for a data modeling problem and the only way to find the best solution is to iterate and try different things in each step of the modeling.
Data Prep
Feature Selection
A common way to simplify the problem is to categorize the variables into different bigger features. For example, you could perform two different analysis: one on the effect of the user's available demographic information and another analysis on the effect of a user's past browsing history. Even within these analysis, one can get confused on how many sub-variables to keep. It's a common problem in data science, one that's called feature selection. One way to select only the significant features is to run a simple random-forest model on all variables. Many implementations of random-forest available online (such as sklearn's) have an attribute called feature_importance that gives a numeric score to each feature. You can then select the most significant features to perform your analysis on.
Data Balancing
Anyone who has worked with data in a commercial space knows that the real data can be very, very noisy. And in the case of online world, engagements are a rare event. This is the case of unbalanced data and don't make a good dataset for modeling. One good rule of thumb that I use when dealing with binary categorical data is that the un-rare event data size (i.e. the number of non-clicks in my data set) should not be greater than 3x the rare-event data size in my data set. Hence, if my dataset size is 1 million, I need to have at least 1/4th of the data size to be of the rare event category for good balance.
Training Data Size
A good rule of thumb is to use 80% of your dataset for training the model you will use and 20% for testing. It's a good idea to shuffle the dataset before splitting the dataset into test/train category, in case the dataset is arranged in a certain order.
Picking the right model
Knowing the underlying probabilityAfter preparing the data set and ensuring that the right features/variables are used, it is time to pick the right model. One thing that always helps to know is the underlying probability distribution of the data. In the case of online engagement, the underlying probability distribution is binomial distribution where the parameters are defined as follows:
n = number of message views
p = underlying probability for a view to convert into an engagement (extra info: this p follows beta distribution)
In this case the output variable is a binary categorical one: engagement/non-engagement and the features are all numerical. Hence, it made sense to use logistic regression which, given numerical variables, outputs the probability of a success.
In one particular case, a pre-eliminary use of Statsmodels GLM (with family=binomial and link=logit) gave me a 62% accuracy rate (with the afore mentioned 20% of the testing set; 80% of the dataset was used to train the model). In this context, accuracy or the AUC score, above 0.6 is considered good enough.
When talking about logistic regression, it is important to talk about the threshold value. Logistic regression outputs the probability that a given case will result to a success case (an engagement). While testing, I needed to determine what the proper threshold of probability for a success case is. It might seem intuitive to say that if an outputted probability is above 0.5, it can be categorized as a success/engagement. However, that does not always gives the best results. Trial and error is the best way to go, and in my case, a threshold of 0.45 gave me the highest accuracy.
With a little help from my friends: Literature
In most cases, the problem in hand has already been studied by other companies/educational and research organizations. Becoming aware of the advancements in the problem space is definitely helpful. In the above case, after getting an AUC score of 62%, I was ready to end my analysis. However, I came across this paper from Facebook that suggests a hybrid model of using gradient boosted regression (gradient boosted decision trees + logistic regression). To elaborate, Facebook suggests a way of feature transformation using decision trees and using the transformed features in a logistic regression model. Here's a good visualization of the hybrid model:
A good approximate implementation of this hybrid model is xgboost. Using this model, my improved AUC score/accuracy was 0.7 (70%).
Tuning
Just picking the right model does not do the trick, usually. For any model, it's important that the parameters that determine how the model trains with the training data you have provided are well-tuned. For this, one needs to understand what the parameters mean and how they affect the model training. In the case of xgboost, following were some of the values I used for the parameters:
max_depth (Maximum depth of trees): 6
eta (the learning rate) :0.5
objective: binary:logistic
~~~~~~~~~~~~~~~~~
To conclude, data analysis is just not picking the right model. It's a lot of data cleaning and tuning and testing decisions. There is usually not a hard and fast answer for a data modeling problem and the only way to find the best solution is to iterate and try different things in each step of the modeling.
0 comments:
Post a Comment