• Analytics Wisdom
  • Posts
  • What is a Cohort Analysis: User Retention Analysis for B2C with SQL

What is a Cohort Analysis: User Retention Analysis for B2C with SQL

Understand the Long-Term Value of Your Customers Using Cohort Analysis

One of the best depictions of cohort analysis. Source: Mosaic Tech

What is a cohort analysis, and why should you give a damn?

Cohort analysis is a powerful analytics technique to understand user behavior over time. That’s it, it is that simple.

Product owners can group users based on a specific attribute, such as sign-up date or subscription plan type. By analyzing these groups, they can gain insights into their product's effectiveness, customer behavior, and overall user retention rates. Understanding cohort analysis and how to identify pain-points of users is gonna be a game-changer for you and your organization. This will be helpful for B2C product managers and analysts.

Let me give you an example:
Imagine you go to a new pancake house 🥞 and try pancakes for the first time. If you liked the pancake, you might return to try different types in a week or two. Cohort analysis helps the pancake shop owner keep track of people like you who return for more pancakes or his other products. This way, the owner can determine what makes their products so delicious and keep you and other customers returning for more.

In this example, the owner could use cohort analysis to identify when and why customers stop returning. Maybe he noticed that customers who first visited during the weekend tended to return more often than those who visited on weekdays. This could lead him to explore potential pain points that may be causing customers to not return during weekdays, such as longer wait times or less attentive service.

I’m this customer

Here are a mere few types of cohorts:

  1. Time-based cohorts: grouping customers based on the time of their first purchase or interaction with the product.

  2. Behavioral cohorts: grouping customers based on their behavior or actions within the product. For example, grouping customers who have made a purchase or who have not made a purchase.

  3. Acquisition cohorts: grouping customers based on the channel or source they came from, such as social media, search engine, or referral.

  4. Demographic cohorts: grouping customers based on demographic information such as age, gender, or location. Usually used in pharmaceutical data analysis.

Cohort Analysis in SQL

Case Study

A coffee beans company has been operating for over a year and wants to understand its user retention. They want to see how many users are returning customers and how many new customers they acquire. They want to use this data to optimize their marketing and sales strategies to increase user retention and sales.

Here is the dataset that we are provided.

What the dataset looks like

The Objective

This case study aims to analyze the user retention rate for the coffee beans online company and identify any trends in user behavior. By understanding customer retention trends, the company can gain insights into

  • Which user segments are most loyal

  • Which segments are at risk of churning

  • How effective retention efforts are.

This will help the company make data-driven decisions on marketing campaigns, loyalty programs, and product offerings. Additionally, by analyzing the revenue generated by customers in each cohort, the company can identify which products and marketing strategies are most effective in driving revenue and prioritize their efforts accordingly.

SQL Code

Note: SQL syntax may vary depending on the tool you use, but the logic explained in this article on how to approach a cohort analysis is global.

First, we aggregate the data by user, cohort month, product, and amount. Let’s call this CTE “cohort.” It also finds the earliest signup date for each user, which will be used to calculate the user's cohort. This CTE will be the foundation for the rest of the analysis.

SELECT
    user_id,
    MIN(signup_date) AS cohort_signup_date,
    DATE_TRUNC('month', order_date) AS cohort_month,
    product_id,
    amount
  FROM
    `concise-atlas-369115.random_sql.user_data` table_A
  GROUP BY
    1, 2, 3, 4, 5

Then create another CTE called “cohorts “that calculates each cohort's total number of users. It groups the data by the cohort month, and the earliest signup date for each user and ultimately counts the number of distinct users.

SELECT
    cohort_month,
    cohort_signup_date,
    product_id,
    amount,
    COUNT(DISTINCT user_id) AS total_users
  FROM
    cohort
  GROUP BY
    1, 2, 3, 4

Now let's create a third CTE called "retention" to calculate the retention rate for each cohort and month. This will help us bring closer to the final result, clearly understanding user retention for our coffee beans online company. To achieve this, we will join the "cohort" and "cohorts" CTEs and group the data by cohort month, month number, and the total number of users. We will also calculate the count of distinct users who placed orders each month for each cohort to determine how many users return to make repeat purchases.

SELECT
    cohort.cohort_month,
    DATE_DIFF('month', cohort.cohort_signup_date, cohort.order_date) AS month_num,
    cohort.product_id,
    cohort.amount,
    COUNT(DISTINCT cohort.user_id) AS user_count,
    cohorts.total_users
  FROM
    cohort
  JOIN
    cohorts
  ON
    cohort.cohort_month = cohorts.cohort_month
    AND cohort.cohort_signup_date = cohorts.cohort_signup_date
    AND cohort.product_id = cohorts.product_id
    AND cohort.amount = cohorts.amount
  GROUP BY
    1, 2, 3, 4, 6

Finally, we combine the puzzle pieces, select the data from the retention CTE, and calculate the retention rate. We will do this by dividing the number of users who placed orders in a given month by the total number of users in the cohort, then multiplying by 100 to get a percentage. This will help us understand how well we retain our customers over time. Finally, we will order the data by cohort month and month number to analyze the retention trend for each cohort.

Retention rate = (number of users who placed orders in a given month / total number of users in the cohort) x 100Retention rate = (number of users who placed orders in a given month / total number of users in the cohort) x 100

Retention rate equation

This is the entire code pieced together.

WITH cohort AS (
  SELECT
    user_id,
    MIN(signup_date) AS cohort_signup_date,
    DATE_TRUNC('month', order_date) AS cohort_month,
    product_id,
    amount
  FROM
    `concise-atlas-369115.random_sql.user_data` table_A
  GROUP BY
    1, 2, 3, 4, 5
),

cohorts AS (
  SELECT
    cohort_month,
    cohort_signup_date,
    product_id,
    amount,
    COUNT(DISTINCT user_id) AS total_users
  FROM
    cohort
  GROUP BY
    1, 2, 3, 4
),

retention AS (
  SELECT
    cohort.cohort_month,
    DATE_DIFF('month', cohort.cohort_signup_date, cohort.order_date) AS month_num,
    cohort.product_id,
    cohort.amount,
    COUNT(DISTINCT cohort.user_id) AS user_count,
    cohorts.total_users
  FROM
    cohort
  JOIN
    cohorts
  ON
    cohort.cohort_month = cohorts.cohort_month
    AND cohort.cohort_signup_date = cohorts.cohort_signup_date
    AND cohort.product_id = cohorts.product_id
    AND cohort.amount = cohorts.amount
  GROUP BY
    1, 2, 3, 4, 6
)

SELECT
  retention.cohort_month,
  retention.month_num,
  retention.product_id,
  retention.amount,
  ROUND(retention.user_count / cohorts.total_users * 100, 2) AS retention_rate
FROM
  retention
JOIN
  cohorts
ON
  retention.cohort_month = cohorts.cohort_month
  AND retention.cohort_signup_date = cohorts.cohort_signup_date
  AND retention. product_id = cohorts.product_id
  AND retention.amount = cohorts.amount
ORDER BY
  1, 2, 3, 4

Using Seaborn and Heatmap to plot the results

Cohort Analysis using Seaborn Heatmap

What does this tell us?

  • The data shows a cohort analysis of coffee purchases over time.

  • The analysis reveals a significant drop-off in the second month (indexed as 1), with only around 25% of customers purchasing on average.

  • The first cohort of customers, who made their initial purchases in December 2019, performed better than the other cohorts, with a higher retention rate after one year.

  • Post 1 year, the retention rate for all cohorts combined was 50%, indicating that half of the customers who made an initial purchase in the first month returned to make another purchase within a year.

  • The high retention rate for the December 2010 cohort suggests that these customers may have had existing connections with the coffee retailer or may have been more dedicated coffee drinkers than the other cohorts. However, it is difficult to draw definitive conclusions from the data alone.

How can the retailer use this?

  • The retailer can identify the months of drop-offs in customer purchases and focus on strategies to increase sales during those periods.

  • They can now target marketing efforts toward customers who made their initial purchase in December 2019, as this cohort has a higher retention rate and may represent a loyal customer base.

  • Product strategies can be created to retain existing customers and attract new ones, such as introducing loyalty programs, offering discounts for repeat purchases, or introducing new coffee blends to keep customers interested.

  • The retailer can analyze the data further to identify the reasons for the drop-off in purchases in the second month and address any potential issues, such as price points, product quality, or customer service.

Thank you for reading. I hope this helped understand how a cohort analysis can help B2C businesses understand customer retention rates. Check out my other article on customer acquisition and retention for B2C and SAAS products.