Using VARISTA to predict customer LTV

2021.01.10 (Sun)
Service Analysis

This article has been translated from Japanese into English using DeepL.

Purpose of this article

The purpose of this article is to help you understand the flow of building a model to predict a customer's future LTV segment using VARISTA.


  • Overview
  • Data Preparation
  • Creating Teacher Data (RMF Analysis)
  • Building a prediction model
  • Checking the prediction model

This method can be used to predict the LTV segment of a customer a little ahead of time, even without information on what the customer has bought.


It is important to know the lifetime value of your customers.
In this article, we will use open data to build a model that predicts which customer segment the customer will be in the next 6 months, based on the customer's purchase information for 3 months.


Prepare the data

The data should contain at least the following items

  • Customer ID
  • Purchase date
  • Purchase amount (quantity, unit price)

In this article, we will use Online Retail Data Set to explain the flow.
This data set contains 541,909 purchase information, and includes eight characteristics. (Figure below)

First, we will process the data and perform RFM analysis to rank the users.
This data contains information from December 1, 2010 to December 9, 2011. This time, we will calculate the RFM by dividing the data into 3 months, and then merge this calculation result with the next 6 months data to create the teacher data.
Example) March 1, 2011 - May 31, 2011 (3 months)
6 months from June 1, 2011 to December 1, 2011, etc.
The delimitation of the time period depends on the industry and service, so the time period should be defined while creating and validating the model.
However, please note that RFM is not suitable for industries where purchases do not occur frequently. (For example, a product that is purchased only once every few years.

Creating Teacher Data

Let's set up an arbitrary 3-month period and calculate RFM for each customer, where RFM is replaced by variables such as Recency, Frequency, and Monetary (Revenue). Segment is clustered by k-Mean using OverallScore and divided into Rank 1 to Rank 3.
The segment is clustered by k-Mean using OverallScore and divided into Rank 1 to Rank 3.

Note that this Segment is different from the LTVCluster that appears in the second half.

Next, calculate the LTV for each customer for the next 6 months and add it to the data.
We simply do a UnitPrice x Quantity and add the result as 6_Month_Revenue.
This allows us to correlate a customer's buying behavior (RFM) over the next 3 months with how much revenue they will bring in after 6 months.

Then cluster the customers into three classes based on their LTV after 6 months.

  • LTVCluster is added to the last column.

Let's see how the breakdown looks like for each cluster.
LTVCluster : A value between 0 and 2, where the lower the number, the lower the LTV.
count: Number of customers in the cluster
mean: Average LTV in the cluster
In the figure below, we can see that cluster 2 generates an average profit of £8,222, while cluster 0 generates an average profit of only £396 in 6 months.

Now, we will proceed to build the model based on the data we just created.

However, since 6_Month_Revenue is used as a variable when clustering the LTVCluster that we want to predict this time, it will cause a leak. Therefore, we will delete 6_Month_Revenue.
Also, VARISTA automatically transforms the category variables, but we will use One Hot Encoding to speed up the learning process.
The final result will be the teacher data like this

Building a Predictive Model

We will use VARISTA's AutoML feature to build a forecasting model.
Create a project and upload the data to VARISTA.
Let's try to visualize the data.
The figure below shows the correlation with LTVCluster, which is correlated with Revenue, and also seems to be correlated with OverallScore.

Next, we will use VARISTA's Auto ML to build the model.
Select Model in the sidebar and select Create AI Model > Start Training.
At this point, make sure that the column to predict is set to LTVCluster.

After a while, the training will be completed and the model built by VARISTA will be displayed. The number of data is not that large, so the training will be completed in about 10-20 minutes.

Checking the prediction model

The overall percentage of correct answers is shown as 80.5%, and the percentage of correct answers for 0 is 95.4%, but the accuracy for 1 and 2 does not seem to be very good.
In the case of VARISTA, it automatically performs cross-validation using 20% of the training data (default value).
To further improve the accuracy, consider adding more variables to the data or creating more variables.
Since the number of data for 1 and 2 is much smaller than the number of data for 0, if it is possible to increase the number, we will consider increasing it to the same level as 0.
In addition to the simple information displayed in VARISTA, you can also see detailed information about the study.
The basic functions of VARISTA are available free of charge, so please register from the link below and give it a try!
Also, if you have any questions about data creation, please feel free to contact us via chat on our official website.

Made with
by VARISTA Team.
© COLLESTA, Inc. 2021. All rights reserved.