When we teach data science we emphasize the data scientist’s responsibility to transform available data from multiple systems of record into a wide or denormalized form. In such a “ready to analyze” form each individual example gets a row of data and every fact about the example is a column. Usually transforming data into this form is a matter of performing the equivalent of a number of SQL joins (for example, Lecture 23 (“The Shape of Data”) from our paid video course Introduction to Data Science discusses this).
One notable exception is log data. Log data is a very thin data form where different facts about different individuals are written across many different rows. Converting log data into a ready for analysis form is called sessionizing. We are going to share a short series of articles showing important aspects of sessionizing and modeling log data. Each article will touch on one aspect of the problem in a simplified and idealized setting. In this article we will discuss the importance of dealing with time and of picking a business appropriate goal when evaluating predictive models.
For this article we are going to assume that we have sessionized our data by picking a concrete near-term goal (predicting cancellation of account or “exit” within the next 7 days) and that we have already selected variables for analysis (a number of time-lagged windows of recent log events of various types). We will use a simple model without variable selection as our first example. We will use these results to show how you examine and evaluate these types of models. In later articles we will discuss how you sessionize, how you choose examples, variable selection, and other key topics.
The Setup
One lesson of survival analysis is that it is a lot more practical to model the hazard function (the fraction of accounts terminating at a given date, conditioned on the account being active just prior to the date) than to directly model account lifetime or account survival. Knowing to re-state your question in terms of hazard is a big step (as is figuring out how to sessionize your data, how to define positive and negative instances, how to select variables, and how to evaluate a model). Let’s set up our example modeling situation.
Suppose you have a mobile app with both free (A) and paid (B) actions; if a customer’s tasks involve too many paid actions, they will abandon the app. Your goal is to detect when a customer is in a state when they are likely to abandon, and offer them (perhaps through an in-app ad) a more economical alternative, for example a “Pro User” subscription that allows them to do what they are currently doing at a lower rate. You don’t want to be too aggressive about showing customers this ad, because showing it to someone who doesn’t need the subscription service is likely to antagonize them (and convince them to stop using your app).
Suppose the idealized data is collected a log style form, like the following:
dayIndex accountId eventType 1 101 act10000 A 2 101 act10000 A 3 101 act10000 A 4 101 act10000 A 5 101 act10000 A 6 101 act10000 A 7 101 act10000 A 8 101 act10000 A 9 101 act10000 A 10 101 act10003 B 11 101 act10003 A 12 101 act10003 A 13 101 act10003 A 14 101 act10003 A 15 101 act10003 A 16 101 act10003 A 17 101 act10003 A 18 101 act10012 B
For every customer, on every day (dayIndex
, which we can think of as the date), we’ve recorded each action, and whether it’s A or B. In realistic data you’d likely have more information, for example exactly what the actions were, perhaps how much the customer paid per B action, and other details about customer history or demographics. But this simple case is enough for our discussion.
Just to analyze data of this type generates several issues:
Ragged vs. uniform use of time when generating training examples
There are two ways to collect customers to use in the training set:
(1) pick a specific date, say one month ago, select a subset of your customer set from that day, and use those customers’ historical data (say, the last few months’ activity for those customers) as the training set. We’ll call this a uniform time training set.
(2) select a subset from the set of all your customers over all time (including some who may not currently be customers), and use their historical data as the training set. We’ll call this a ragged time training set.
The first method has the advantage that the training set exactly reflects how the model will be applied in practice: on a set of customers all on the same date. However, it limits the size of your training set, and if abandonment is very rare, then it limits the number of positive examples available for the modeling algorithm to learn from. The second method potentially allows you to build a larger training set (with more positive examples), but it has a number of pitfalls:
- The prevalence of positive examples in the training set may differ from the prevalence you would observe on a given day. If the abandonment process in your customer population is stationary — it does not change over time and has no trends — then the abandonment rate in a ragged training set will look like the abandonment rate in a uniform training set. It’s unlikely that the abandonment process is stationary (though perhaps it’s nearly so). If you are using a modeling algorithm that is sensitive to class prevalence, like logistic regression, this can cause a problem.
A corollary to this observation is that even if you use a uniform training set, you should be prepared to retrain or otherwise update the model at a reasonable frequency, to account for concept drift.
- Time trends in the features. Variables may have different meanings in different time periods. In our example, you may have changed the prices of the paid features in your phone app; in another domain, a $200,000 home may mean something different today than it did last year (relative to the median home price in the region, for example). If you are using data from different time periods, you should account for such effects.
You could consider using several uniform time sets: pick a date from last month, one from the month before, and so on. If the abandonment process changes slowly enough, this alleviates the data scarcity issue without affecting the prevalence of positive examples. You may still have issues with time trends in the variables, and you will have duplicated data: many customers from a month ago were also customers two months ago, and so can show up in the data twice. Depending on the domain and your goal, the may or may not matter. Also, you need to be careful that the same customer does not end up both in the training and test sets (see our article on structured test/train splits).
Defining Positive Examples
What do you consider a positive example? A customer who will leave tomorrow, within the next week, or within the next year? Predicting abandonment from long range data is nice, but it’s also a noisier problem; someone who will leave a year from now probably looks today a lot like someone who won’t leave in a year. If minimizing false positives is a subgoal (as it is in our example problem), then you might not want to attempt predicting long-range. Hopefully the signals will be stronger the closer a customer gets to abandoning, but you also want to catch them while you still have time to do something about it.
Picking the Features
In this example, you suspect that customers abandon your app when they start to access paid features at too high a rate. But what’s too high a rate? Is that measured in absolute terms, or relative to their total app usage? And what’s the proper measurement window? You want to measure their usage rates over a window that’s not too noisy, but still detects relevant patterns in time for the information to be useful.
The Data
For this artificial example, we created a population of customers who initially begin in a “safe” state in which they generate events via two Poisson processes, with A events generated at ten times the rate of B events. Customers also have a 10% chance every day of switching to an “at risk” state, in which they begin to generate B events at five times the rate that they did in the “safe” state (they also generate A events at a reduced rate, so that their total activity rate stays constant). Once they are in the “at risk” state, they have a 20% chance of exiting (abandoning the app — recorded as state X).
To build a data set, we start with an initial customer population of 1500, let the simulation run for 100 days to “warm up” the population and get rid of boundary conditions, then collect data for 100 more days to form the data set. We also generate new customers every day via a Poisson process with an intensity of 100 customers per day. The expected time for a customer to go into “at risk” is ten days; once they are in the “at risk” state, they stay another five days (in expectation), giving an expected lifetime of fifteen days (of course in reality you wouldn’t know about the internal state changes of your customers). Note that by the way we’ve constructed the population, the lifetime process is in fact stationary and memoryless.
This is obviously much cleaner data than you would have in real life, but it’s enough to let us walk through the analysis process.
The Data Treatment
We chose a uniform time training set: a set of customers present on a reference day (“day 0”) and the ten days previous to that (days 1 through 10), and recorded how many days until each customer exited (Inf
for customers who never exit), counting from day 0. The hold-out set is of the same structure. We defined positive examples as those customers who would exit within seven days of day 0. Rather than guessing the appropriate sessionizing window length ahead of time, we constructed all possible windows within those ten days, and counted the relative rates of A events and B events in each window. This gave us data sets of approximately 650 rows (648 for training, 660 for hold-out) and 132 features; one row per customer, one feature per window. We’ll discuss how we created the wide data sets from the “skinny” log data in a future post; you can download the wide data set we used as an .rData
file here.
The resulting data has the following columns:
colnames(dTrainS) [1] "accountId" "A_0_0" "A_1_0" [4] "A_1_1" "A_10_0" "A_10_1" [7] "A_10_10" "A_10_2" "A_10_3" [10] "A_10_4" "A_10_5" "A_10_6" [13] "A_10_7" "A_10_8" "A_10_9" [16] "A_2_0" "A_2_1" "A_2_2" [19] "A_3_0" "A_3_1" "A_3_2" [22] "A_3_3" "A_4_0" "A_4_1" [25] "A_4_2" "A_4_3" "A_4_4" [28] "A_5_0" "A_5_1" "A_5_2" [31] "A_5_3" "A_5_4" "A_5_5" [34] "A_6_0" "A_6_1" "A_6_2" [37] "A_6_3" "A_6_4" "A_6_5" [40] "A_6_6" "A_7_0" "A_7_1" [43] "A_7_2" "A_7_3" "A_7_4" [46] "A_7_5" "A_7_6" "A_7_7" [49] "A_8_0" "A_8_1" "A_8_2" [52] "A_8_3" "A_8_4" "A_8_5" [55] "A_8_6" "A_8_7" "A_8_8" [58] "A_9_0" "A_9_1" "A_9_2" [61] "A_9_3" "A_9_4" "A_9_5" [64] "A_9_6" "A_9_7" "A_9_8" [67] "A_9_9" "B_0_0" "B_1_0" [70] "B_1_1" "B_10_0" "B_10_1" [73] "B_10_10" "B_10_2" "B_10_3" [76] "B_10_4" "B_10_5" "B_10_6" [79] "B_10_7" "B_10_8" "B_10_9" [82] "B_2_0" "B_2_1" "B_2_2" [85] "B_3_0" "B_3_1" "B_3_2" [88] "B_3_3" "B_4_0" "B_4_1" [91] "B_4_2" "B_4_3" "B_4_4" [94] "B_5_0" "B_5_1" "B_5_2" [97] "B_5_3" "B_5_4" "B_5_5" [100] "B_6_0" "B_6_1" "B_6_2" [103] "B_6_3" "B_6_4" "B_6_5" [106] "B_6_6" "B_7_0" "B_7_1" [109] "B_7_2" "B_7_3" "B_7_4" [112] "B_7_5" "B_7_6" "B_7_7" [115] "B_8_0" "B_8_1" "B_8_2" [118] "B_8_3" "B_8_4" "B_8_5" [121] "B_8_6" "B_8_7" "B_8_8" [124] "B_9_0" "B_9_1" "B_9_2" [127] "B_9_3" "B_9_4" "B_9_5" [130] "B_9_6" "B_9_7" "B_9_8" [133] "B_9_9" "daysToX" "defaultsSoon"
The feature columns are labeled by type of event (A or B), the first day of the window, and the last day of the window: so A_0_0
means “fraction of events that were A events today (day 0)”, B_8_5
means “fraction of events that were B events from eight days back to five days back” (a window of length 4), and so on. The column daysToX
is the number of days until the customer exits; defaultsSoon
is true if daysToX <= 7
This naive sessionizing can quickly generate very wide data sets, especially if: there are more than two classes of events; if we want to consider wider windows; or if we have several types of log measurements that we want to aggregate and sessionize. You can imagine situations where you generate more features than you have datums (customers) in the training set. In future posts we will look at alternative approaches.
Modeling
Principled feature selection (or even better, principled feature generation) before modeling is a good idea, but for now let’s just feed the sessionized data into regularized (ridge) logistic regression and see how well it can predict soon-to-exit customers.
library(glmnet) # loads vars (names of vars), yVar (name of y column), # dTrainS, dTestS load("wideData.rData") # assuming the xframe is entirely numeric # if there are factor variables, use # model.matrix ridge_model = function(xframe, y, family="binomial") { model = glmnet(as.matrix(xframe), y, alpha=0, lambda=0.001, family=family) list(coef = coef(model), deviance = deviance(model), predfun = ridge_predict_function(model) ) } # assuming xframe is entirely numeric ridge_predict_function = function(model) { # to get around the 'unfullfilled promise' leak. blech. force(model) function(xframe) { as.numeric(predict(model, newx=as.matrix(xframe), type="response")) } } model = ridge_model(dTrainS[,vars], dTrainS[[yVar]]) testpred = model$predfun(dTestS[,vars]) dTestS$pred = testpred
Evaluating the Model
You can plot the distribution of model scores on the holdout data as a function of class label:
The model mostly separates about-to-exit customers from the others, although far from perfectly (the AUC of this model is 0.78). To evaluate whether this model is good enough, you should take into account how the output of the model is to be used. You can use the model as a classifier, by picking a threshold score (say 0.5) to sort the customers into “about to exit” and not. In this case, look at the confusion matrix:
dTestS$predictedToLeave = dTestS$pred>0.5 # confusion matrix cmat = table(pred=dTestS$predictedToLeave, actual=dTestS[[yVar]]) cmat ## actual ## pred FALSE TRUE ## FALSE 205 80 ## TRUE 99 276 recall = cmat[2,2]/sum(cmat[,2]) recall ## [1] 0.7752809 precision = cmat[2,2]/sum(cmat[2,]) precision ## [1] 0.736
The model found 78% of the about-to-exit customers in the holdout set; of the customers identified as about-to-exit, about 74% of them actually did exit within seven days (26% false positive rate).
Alternatively you could use the model to prioritize your customers with respect to who should see in-app ads that encourage them to consider a subscription service. The improvement you can get by using the model score to prioritize ad placement is summarized in the gain curve:
If you sort your customers by model score (decreasing), then the blue curve shows what fraction of about-to-leave customers you will reach, as a fraction of the number of customers you target based on the model’s recommendations; the green curve shows the best you can do on this population of customers, and the diagonal line shows what fraction of about-to-leave customers you reach if you target at random. As shown on the graph, if you target the 20% highest-risk customers (as scored by the model), you will reach 30% of your about-to-leave customers. This is an improvement over the 20% you would expect to hit at random; the best you could possibly do targeting only 20% of your customers is about 37% of the about-to-leaves.
The confusion matrix and the gain curve help you to pick a trade-off between targeting in-app ads to try to retain at-risk customers, without antagonizing customers who are not at risk by showing too many of them an irrelevant ad.
Evaluating Utility
The distribution of days until exit by class label confirms that “risky” (according to the model) customers do in general exit sooner:
But you also want to double-check that the model identifies abandoning customers soon enough. Once the model has identified someone as being at risk, how long do you have to intervene?
# make daysToX finite. The idea is that the live-forevers should be rare isinf = dTestS$daysToX==Inf maxval = max(dTestS$daysToX[!isinf]) dTestS$daysToX = with(dTestS, ifelse(daysToX==Inf, maxval, daysToX)) # how long on average until flagged customers leave? posmean = mean(dTestS[dTestS$predictedToLeave, "daysToX"]) posmean ## [1] 5.693333 # how many days until true positives (customers flagged as leaving # who really do leave) leave? tpfilter = dTestS$predictedToLeave & dTestS[[yVar]] trueposmean = mean(dTestS[tpfilter, "daysToX"]) trueposmean ## [1] 2.507246
Ideally, you’d like the above distribution to be skewed to the right: that is, you want the model to identify at-risk customers as early as possible. You probably can’t intervene in time to save customers who are leaving today (day 0) or tomorrow (you can think of these customers as recall errors from “yesterday’s” application of the model. Fortunately, on average this model catches at-risk customers a few days before they leave, giving you time to put the appropriate in-app ad in front of them. Once you put this model into operation, you will further want to monitor the flagged customers, to see if your intervention is effective.
Conclusion
For sessionized problems the easiest way to make a “best classifier” is to cheat the customer and try only to predict events right before they happen. This allows your model to use small windows of near-term data and look artificially good. In practice you need to negotiate with your customer how far out a prediction is useful for the customer and build a model with training data oriented towards that goal. Even then you must re-inspect such a model, as even a properly trained near-term event model will have a significant (and low-utility) component given by events that are essentially happening immediately. These “immediate events” are technically correct predictions (so they don’t penalize precision and recall statistics), but are also typically of low business utility as they don’t give the business time for a meaningful intervention.
Next:
As mentioned above, you would prefer to have a principled variable selection technique. This will be the topic of our next article in this series.
The R markdown script describing our analysis is here. The plots are generated using our own in-progress visualization package, WVPlots
. You can find the source code for the package on GitHub, here.
The plot of aggregated door traffic log data shown at the top of the post uses data from Ilher, Hutchins and Smyth, “Adaptive event detection with time-varying Poisson processes”, Proceedings of the 12th ACM SIGKDD Conference (KDD-06), August 2006. The data can be downloaded from the UCI Machine Learning Repository, here.
Categories: Expository Writing Pragmatic Data Science Statistics To English Translation Tutorials
Nina Zumel
Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.
2 replies ›