Synthetic Data: The $8 Billion Business of Making Up (Real) Data

Nvidia just paid $320 million for a company that generates fake data. And it makes perfect sense.

The synthetic data market for AI just crossed $2.4 billion. By 2029, it’ll hit $8 billion. Nvidia bought Gretel Labs for $320 million. And most data professionals still don’t know what synthetic data actually is.

Let’s fix that.

What is synthetic data

Synthetic data is artificially generated data that mimics the statistical properties of real data, without containing actual information about real people or systems.

Simple example: you have a database of 10,000 patients with their diagnoses, treatments, and outcomes. You can’t share it with external researchers because it contains protected health information. But you can generate 100,000 “synthetic” patients that have the same distributions, correlations, and patterns as the real ones—without any of them actually existing.

Researchers can train models on that synthetic data. The models learn the same patterns they’d learn from real data. And nobody’s privacy has been violated.

Why it’s exploding now

1. Regulation is tightening

GDPR in Europe, HIPAA in healthcare, financial regulations… it’s getting harder (and riskier) to use real personal data for training models.

Fines are massive. Reputational risks are worse. Companies need alternatives.

Synthetic data is that alternative: you can train models without touching real personal data.

2. There isn’t enough real data

Sounds counterintuitive in the “big data” era, but for many use cases, there simply isn’t enough real data.

Want to train a fraud detection model? 99.9% of transactions are legitimate. You have billions of “not fraud” examples and a few thousand “fraud” ones. The model doesn’t learn.

Solution? Generate synthetic fraudulent transactions that mimic real fraud patterns. Now you have a balanced dataset.

Same applies to: rare diseases, extreme events, crisis scenarios, edge cases of any kind.

3. Generative models got really good

The technology to generate quality synthetic data didn’t exist five years ago. Now it does.

GANs (Generative Adversarial Networks), VAEs (Variational Autoencoders), and more recently diffusion models can generate tabular data, images, time series, and text that are statistically indistinguishable from real data.

Quality has reached the point where synthetic data is genuinely useful, not just an academic experiment.

Real-world use cases

Healthcare

The most obvious sector. Medical data is extremely sensitive and extremely valuable for training AI.

With synthetic data you can:

Share datasets between hospitals without moving patient data
Train diagnostic models on populations you don’t have in your hospital
Test systems before deploying them with real data

Finance

Fraud detection, credit scoring, risk analysis. All of this requires data that banks can’t freely share.

Synthetic data enables:

Training models internally without exposing customer data
Sharing datasets with external vendors securely
Generating stress scenarios that have never happened (but could)

Automotive

Self-driving cars need millions of hours of driving data to train. Including dangerous scenarios you can’t create in real life.

Simulators generate synthetic driving data: pedestrians crossing, extreme weather, mechanical failures. The car “learns” to react without putting anyone in danger.

Software testing

Need to test your app with 10 million users? You don’t have 10 million real users. But you can generate 10 million synthetic users with realistic behaviors.

The limitations (that nobody mentions)

Quality depends on the original data

If your real data has biases, your synthetic data will inherit those biases. It’s not magic: it’s statistics.

A generator trained on biased data produces biased synthetic data. The problem doesn’t disappear, it just gets hidden.

They’re not perfect for everything

Synthetic data captures statistical patterns. It doesn’t capture unique cases, genuine outliers, or complex causal relationships.

For some use cases (detecting real anomalies, for example), you need real data. Synthetics can’t invent what they’ve never seen.

Validation is tricky

How do you know your synthetic data is “good enough”? It’s not trivial. You need quality metrics, comparisons with real holdouts, and lots of experimentation.

There are tools for this, but they require expertise.

Tools to get started

Gretel.ai: now owned by Nvidia. Cloud platform for generating tabular and time series synthetic data. They have a free tier.

Synthetic Data Vault (SDV): open source Python library. You can generate synthetic tabular data with a few lines of code. Good entry point.

CTGAN: model specifically for tabular data. Part of the SDV ecosystem but can be used independently.

Faker: not AI, but useful for generating quick test data (fake names, addresses, emails). Handy for basic testing.

Mostly AI: Gretel competitor, focused on privacy and compliance.

Why Nvidia paid $320 million

Nvidia isn’t stupid. They see where the market is heading.

Training AI models requires data. Real data is getting harder to obtain legally. Synthetic data is the solution.

If Nvidia controls synthetic data generation in addition to the hardware for training models, they control more links in the AI value chain.

It’s a strategic bet. And probably a good one.

My take

Synthetic data isn’t the future. It’s the present. If you work with data and don’t have this on your radar, you’re missing an important tool.

It won’t replace real data for everything. But for many use cases (privacy, scarcity, class balancing), it’s the best option available.

My recommendation: download SDV, generate synthetic data from some dataset you have, and compare. You’ll learn more in an afternoon of experimentation than reading ten articles.