At Overseed, we aim to help developers demonstrate, test, and validate data applications. This post will look at building a synthetic dataset for linear regression analyses.
Linear regression analyses model the relationship between variables and an outcome. We can use this model to create a function that offers a prediction outcome given an input. Designing a synthetic dataset with "baked-in" properties allows developers to examine and demonstrate the model's behavior. A synthetic dataset provides benefits such as:
- Trial multiple Scenarios: Build multiple synthetic datasets representing different possible outcomes.
- Run integration tests: Ensure the model returns the expected results and avoids regression bugs.
- Demonstrate applications: Demo an application with data targeting an "aha" moment.
Use Case
Assume we want to analyze the effects of exercise and weight on happiness. We collect data using the following questions:
- How happy do you feel this week on a scale of 1-10?
- How many hours of exercise have you completed this week?
- What is your weight this week?
Ultimately we want to analyze the relationship between exercise hours and body weight on an individual's happiness. Then, build a model which can predict happiness given a persons weight and exercise hours for a given week. The dataset can help us prototype and build this model without the barriers that come with real data.
To build our synthetic dataset, first, we define the variables of our dataset. These variables are betas and function like sliders allowing us to build a dataset with the outcome we want.
- π1: Average Happiness of our users
- π2: Additional change in happiness from each hour exercised in a week.
- π3: Change in happiness that comes from not being overweight.
Second, we create a formula that associates our variables (exercise hours and weight) to our outcome (happiness).
Happiness = π1 + π2 * Hours Exercised + π3 * Is Overweight
The above formula allows us to input different values for the betas and determine the coefficients output by a linear regression model. Built-in coefficients mean we know the approximate answer to the model before we run it! Let's use Overseed to build a dataset using the above formula and then confirm our baked-in coefficients in a Python Jupyter Notebook.
Overseed at work
First, we define the following values our betas:
- We set π1 to 7, setting Average Happiness to be 7 on a 1-10 scale.
- We set π2 to 0.3, increasing happiness by that rate for each additional hour of exercise.
- We set π3 to 0.5, increasing happiness by that rate for people who are not overweight.
Second, we generate random values for our variables using data distributions.
- We use a Beta distribution to get the baseline happiness for each user, centering our average happiness at 7.
- We use a Poisson distribution to generate the number of hours each person exercises.
- We use a Probability distribution to split the population into normal and non-normal weight (underweight, overweight, and so on).
Next, We calculate happiness in our Overseed Schema to get the desired dataset.
The schema below embodies the above definitions.
// betas (coefficients)
average_happiness: 7
excercise_hour_boost: 0.3
healthy_weight_boost: 0.5
// excercise hours (poisson distribution)
excercise_hours_value: #SpecPoisson & {
lambda: 1
}
excercise_hours: excercise_hours_value.eval + 1
// happiness (beta distribution)
happiness_value: #SpecBeta & {
alpha: 7
beta: 3
}
happiness_rating: happiness_value.eval * 10
// healthy_weight
healthy_weight: #SpecProbability & {
values: [1, 0]
probabilities: [50, 50]
}
happiness: happiness_rating + (excercise_hour_boost * excercise_hours) + (healthy_weight_boost * healthy_weight.eval)
Finally, In our Python Jupyter Notebook., we calculate the coefficients using linear regression. The result from our analysis is as follows:

Our results show that our synthetic dataset has been randomly generated but retains the following properties we set!
- Average happiness is our y-intercept which stands at 6.93 (~7)
- Hours exercised stands at a coefficient of 0.31Β (~0.3)
- Not Overweight returns a coefficient of 0.55 (~0.5)
Summary
We can design a dataset using Overseed's distribution specifications and arithmetic operators to generate a randomized dataset of any size with the underlying properties we want. We can then use this synthetic dataset in multiple ways:
- Demonstrate an application that scores user happiness given exercise hours and weight for a given week.
- Include the dataset in our Continuous Integration pipeline to test our algorithm, given that we know the expected outputs and avoid regression bugs.Β
- Build multiple synthetic datasets with different beta values to cover the different scenarios and edge cases we may encounter in our production data.
To run the associated notebook, sign-up here.
-----
The code for this article is in this notebook. Feel free to copy the notebook to your own Colaboratory Google Workspace. You can run the notebook using the following steps:
- Create the schema in this post under a project in your Overseed organization.
- Create a static generator of 10,000 rows using the schema.
- Get your Overseed API Token from the Account Settings and enter it into the input box when prompted at run-time.
- Get the csv URL for your static generator and enter it into the input box when prompted at run-time.