Introduction
This blog introduces the probability matrix specification. This spec allows users to configure a probability on a combination of two or more values! We will use a Market Basket Analysis to demonstrate this feature.
Use Case
Imagine a grocery store that would like to understand the relationship between two items sold in the store. The store would like to know if there is a relationship between Milk and Eggs in customers' purchases.
This relationship will help them gain insight into the following questions:
- Should we place Milk and Eggs closer to each other in the store?
- Can we suggest Eggs as a customer's next purchase if they put Milk in their basket on our online store?
- Can we bundle the Milk and Eggs with a discount to increase the sales of both items, for example, when we have an overstock?
The store tasks the data science team with building an algorithm that reveals the association between Milk and Eggs. Conclusively the team wants to answer the following question:
- How likely are customers going to buy Eggs if they buy Milk?
The team wants to prototype the algorithm right away. However, the same old question comes up:
How do we get access to this data?
That question always holds back the data science team. It could mean days or weeks, requesting access to databases, getting approvals, or creating scripts to generate mock data. These activities take away from the ability to prototype quickly and focus on a working solution.
Overseed to the rescue
With Overseed, they can start right away using synthetic data! The team can build a schema and generate the data they need for their prototype. Moreso, they can update this schema as they uncover edge-cases and make changes to their data.
Building the schema
The team designs a schema and configures the values to establish a strong association between Milk and Eggs as follows:
item_1: #SpecFakeType & {
fakename: "fruit"
}
item_2: #SpecFakeType & {
fakename: "vegetable"
}
items: #SpecProbabilityMatrix & {
values: {
item_3: ["Milk", "Eggs", "Milk", ""]
item_4: [ "" , "" , "Eggs", ""]
},
probabilities: [10, 10, 5, 75]
}
The schema specifies four fields, simulating a market basket of 4 items:
- item_1 specifies the attribute as a fake type fruit. The specification will result in a random fruit value for that field, mimicking a customer placing a random fruit in their market basket.
- item_2 specifies the attribute as a fake type of vegetable. The specification will result in a random vegetable value for that field, mimicking a customer placing a random vegetable in their market basket.
- item_3 and item_4 are specified within a probability matrix specification. In the specification, we can find the pair "Milk" and "Eggs" with a probability of 5%. This probability indicates that customers will buy Milk and Eggs 5% of the time. In other words, in a dataset of 100 market baskets, Milk will appear in item_3, and Eggs will appear in item_4 approximately ten times.
Generating the data
Now that the team has configured the association between Milk and Eggs; They can generate a synthetic dataset for their algorithm to analyze. They input the schema to Overseed and generate a CSV file. If the team chose to make this a real-time system in the future, they could use a stream.
Connecting to the data
The team decided to build the prototype in a Jupyter Notebook. In the notebook, the team calls the Overseed API and retrieves the generated dataset.
Next, the notebook displays the counts for the values in the synthetic dataset and a histogram for the Milk and Eggs values, as shown in the image below.

The histogram shows the equally distributed values for Milk and Eggs individually. In comparison, the combination of the values is about half of the count. They know the Apriori algorithm should calculate a strong relationship between the two items!
Finally, they calculate the association between all the items in our dataset. The team is interested in only the Milk and Eggs relationship, so they output that result as shown in the image below.

Success! The team was able to capture the association between Milk and Eggs as configured in the synthetic dataset. They can now conclude the following from their output:
- Milk and Eggs are bought together by about 4.85% of our customers.
- Customers are likely to buy Milk and Eggs about 32.79% of the time.
- It is about 218.62% more likely for customers to purchase Eggs if they just bought Milk.
They are now ready to demo the prototype and apply it to production data with confidence. Their schema in Overseed can evolve as their analysis evolves.
Summary
Using Overseed's multi-value probability specification, the team created an association between Milk and Eggs in the data and then discovered it using their algorithm. The team can now apply this algorithm in the systems where it is helpful such as:
- Online store basket recommendations
- Store product placement
- Discount Bundle campaigns
Ultimately, we saw how a data science team can build a Market Basket Analysis using a synthetic dataset generated by Overseed. The synthetic dataset allowed the team to prototype quickly without being blocked by data access barriers. Moreover, Overseed's versioning feature will enable them to add more attributes and capture edge-cases as the project evolves.
We look forward to how you use this feature in the wild. Give it a try and give us feedback.
The code for this article is in this notebook. Feel free to copy the notebook to your own Colaboratory Google Workspace. You can run the notebook using the following steps:
- Create the schema in this post under a project in your Overseed organization.
- Create a static generator of 10,000 rows using the schema.
- Replace the dataset URL with your static generator's URL.
- Get your Overseed API Token from the Account Settings and enter it into the input box when prompted at run-time.