Synthetic Data: The $8 Billion Business of Making Up (Real) Data
Nvidia just paid $320 million for a company that generates fake data. And it makes perfect sense.
The synthetic data market for AI just crossed $2.4 billion. By 2029, it’ll hit $8 billion. Nvidia bought Gretel Labs for $320 million. And most data professionals still don’t know what synthetic data actually is.
Let’s fix that.
What is synthetic data
Synthetic data is artificially generated data that mimics the statistical properties of real data, without containing actual information about real people or systems.
Simple example: you have a database of 10,000 patients with their diagnoses, treatments, and outcomes. You can’t share it with external researchers because it contains protected health information. But you can generate 100,000 “synthetic” patients that have the same distributions, correlations, and patterns as the real ones—without any of them actually existing.
Researchers can train models on that synthetic data. The models learn the same patterns they’d learn from real data. And nobody’s privacy has been violated.
Why it’s exploding now
1. Regulation is tightening
GDPR in Europe, HIPAA in healthcare, financial regulations… it’s getting harder (and riskier) to use real personal data for training models.
Fines are massive. Reputational risks are worse. Companies need alternatives.
Synthetic data is that alternative: you can train models without touching real personal data.
2. There isn’t enough real data
Sounds counterintuitive in the “big data” era, but for many use cases, there simply isn’t enough real data.
Want to train a fraud detection model? 99.9% of transactions are legitimate. You have billions of “not fraud” examples and a few thousand “fraud” ones. The model doesn’t learn.
Solution? Generate synthetic fraudulent transactions that mimic real fraud patterns. Now you have a balanced dataset.
Same applies to: rare diseases, extreme events, crisis scenarios, edge cases of any kind.
3. Generative models got really good
The technology to generate quality synthetic data didn’t exist five years ago. Now it does.
GANs (Generative Adversarial Networks), VAEs (Variational Autoencoders), and more recently diffusion models can generate tabular data, images, time series, and text that are statistically indistinguishable from real data.
Quality has reached the point where synthetic data is genuinely useful, not just an academic experiment.
Real-world use cases
Healthcare
The most obvious sector. Medical data is extremely sensitive and extremely valuable for training AI.
With synthetic data you can:
- Share datasets between hospitals without moving patient data
- Train diagnostic models on populations you don’t have in your hospital
- Test systems before deploying them with real data
Finance
Fraud detection, credit scoring, risk analysis. All of this requires data that banks can’t freely share.
Synthetic data enables:
- Training models internally without exposing customer data
- Sharing datasets with external vendors securely
- Generating stress scenarios that have never happened (but could)
Automotive
Self-driving cars need millions of hours of driving data to train. Including dangerous scenarios you can’t create in real life.
Simulators generate synthetic driving data: pedestrians crossing, extreme weather, mechanical failures. The car “learns” to react without putting anyone in danger.
Software testing
Need to test your app with 10 million users? You don’t have 10 million real users. But you can generate 10 million synthetic users with realistic behaviors.
The limitations (that nobody mentions)
Quality depends on the original data
If your real data has biases, your synthetic data will inherit those biases. It’s not magic: it’s statistics.
A generator trained on biased data produces biased synthetic data. The problem doesn’t disappear, it just gets hidden.
They’re not perfect for everything
Synthetic data captures statistical patterns. It doesn’t capture unique cases, genuine outliers, or complex causal relationships.
For some use cases (detecting real anomalies, for example), you need real data. Synthetics can’t invent what they’ve never seen.
Validation is tricky
How do you know your synthetic data is “good enough”? It’s not trivial. You need quality metrics, comparisons with real holdouts, and lots of experimentation.
There are tools for this, but they require expertise.
Tools to get started
Gretel.ai: now owned by Nvidia. Cloud platform for generating tabular and time series synthetic data. They have a free tier.
Synthetic Data Vault (SDV): open source Python library. You can generate synthetic tabular data with a few lines of code. Good entry point.
CTGAN: model specifically for tabular data. Part of the SDV ecosystem but can be used independently.
Faker: not AI, but useful for generating quick test data (fake names, addresses, emails). Handy for basic testing.
Mostly AI: Gretel competitor, focused on privacy and compliance.
Why Nvidia paid $320 million
Nvidia isn’t stupid. They see where the market is heading.
Training AI models requires data. Real data is getting harder to obtain legally. Synthetic data is the solution.
If Nvidia controls synthetic data generation in addition to the hardware for training models, they control more links in the AI value chain.
It’s a strategic bet. And probably a good one.
My take
Synthetic data isn’t the future. It’s the present. If you work with data and don’t have this on your radar, you’re missing an important tool.
It won’t replace real data for everything. But for many use cases (privacy, scarcity, class balancing), it’s the best option available.
My recommendation: download SDV, generate synthetic data from some dataset you have, and compare. You’ll learn more in an afternoon of experimentation than reading ten articles.
You might also like
17% of Basque companies use AI — and they're earning 8.7% more: what they're doing differently
While 95% of AI pilots fail globally, the Basque Country shows a model that actually works. Analysis of the BAIC 2025 diagnosis.
The copilot lie: why 95% of companies aren't seeing results
Studies from METR, MIT and California Management Review reveal AI isn't delivering the promised productivity gains. Data-driven analysis and what successful companies are doing differently.
AI is running out of internet to eat
AI models consume data faster than we create it. Quality internet content is almost used up. What comes next?