Synthetic Data in a Market Basket Analysis

5 min read

Introduction

This week we launched the multi-value probability specification. This specification allows users to configure a probability on a combination of two or more values! We will use a Market Basket Analysis to demonstrate this feature.

Use Case

Imagine a grocery store that would like to understand the relationship between two items sold in the store. The store would like to know if there is a relationship between Milk and Eggs in customers' purchases.

This relationship will help them gain insight into the following questions:

The store tasks the data science team with building an algorithm that reveals the association between Milk and Eggs. Conclusively the team wants to answer the following question:

The team wants to prototype the algorithm right away. However, the same old question comes up:

How do we get access to this data?

That question always holds back the data science team. It could mean days or weeks, requesting access to databases, getting approvals, or creating scripts to generate mock data. These activities take away from the ability to prototype quickly and focus on a working solution.

Overseed to the rescue

With Overseed, they can start right away using synthetic data! The team can build a schema and generate the data they need for their prototype. Moreso, they can version this schema as they uncover edge-cases and make changes to their data.

Building the schema

The team designs a schema and configures the values to establish a strong association between Milk and Eggs as follows:

item_1: "fruit"
item_2: "vegetable"
item_3: "string"
item_4: "string"
RandoSpecs: [
    #RandoSpec & {
        linkValues: {
            "item_3": ["Milk", "Eggs", "Milk", ""]
            "item_4": [  ""  ,   ""  , "Eggs", ""]
        }
        p: {
            "0": 10
            "1": 10
            "2": 5
            "3": 75
        }
    }
]

The schema specifies four fields, simulating a market basket of 4 items:

Generating the data

Now that the team has configured the association between Milk and Eggs; They can generate a synthetic dataset for their algorithm to analyze. They directly upload this schema to Overseed and create a static generator. A static generator is a good choice since the team will use a CSV file in their prototype. If the team chose to make this a real-time system in the future, they could use a stream generator.

Connecting to the data

The team decided to build the prototype in a Jupyter Notebook. In the notebook, the team calls the Overseed API and retrieves the generated dataset.

Next, the notebook displays the counts for the values in the synthetic dataset and a histogram for the Milk and Eggs values, as shown in the image below.

The histogram shows the equally distributed values for Milk and Eggs individually. In comparison, the combination of the values is about half of the count. They know the Apriori algorithm should calculate a strong relationship between the two items!

Finally, they calculate the association between all the items in our dataset. The team is interested in only the Milk and Eggs relationship, so they output that result as shown in the image below.

Success! The team was able to capture the association between Milk and Eggs as configured in the synthetic dataset. They can now conclude the following from their output:

  1. Milk and Eggs are bought together by about 4.85% of our customers.
  2. Customers are likely to buy Milk and Eggs about 32.79% of the time.
  3. It is about 218.62% more likely for customers to purchase Eggs if they just bought Milk.

They are now ready to demo the prototype and apply it to production data with confidence. Their schema in Overseed can evolve their analysis evolves.

Summary

Using Overseed's multi-value probability specification, the team created an association between Milk and Eggs in the data and then discovered it using their algorithm. The team can now apply this algorithm in the systems where it is helpful such as:

Ultimately, we saw how a data science team can build a Market Basket Analysis using a synthetic dataset generated by Overseed. The synthetic dataset allowed the team to prototype quickly without being blocked by data access barriers. Moreover, Overseed's versioning feature will enable them to add more attributes and capture edge-cases as the project evolves.


We look forward to how you use this feature in the wild. Give it a try and give us feedback.

We are open for early access! Sign-up here.



The code for this article is in this notebook. Feel free to copy the notebook to your own Colaboratory Google Workspace. You can run the notebook using the following steps:

  1. Create the schema in this post under a project in your Overseed organization.
  2. Create a static generator of 10,000 rows using the schema.
  3. Replace the dataset URL with your static generator's URL.
  4. Get your Overseed API Token from the Account Settings and enter it into the input box when prompted at run-time.