Data-Centric AI: why more data doesn't mean better models

TL;DR

Data-Centric AI: improve data > improve models (Andrew Ng)
More data can be worse: accumulated noise, amplified bias, unnecessary cost
The 3 C’s: Curation, Consistency, Contextualization
Key tools: Great Expectations, Evidently AI, DVC, Cleanlab
A simple model with excellent data beats a complex model with mediocre data

For years, the recipe for machine learning success seemed simple: get more data, train bigger models, get better results. It was the era of “big data solves everything.”

That era is over.

The new paradigm: Data-Centric AI

Andrew Ng, co-founder of Google Brain and Coursera, has been evangelizing this concept for years. The central idea is provocative: instead of obsessing over increasingly complex model architectures, we should invest that energy in improving our data quality.

It’s not that models don’t matter. It’s that we’ve reached a point of diminishing returns. The difference between a good model and an excellent one is no longer in adding more layers or parameters. It’s in the data we feed it.

This connects directly to something I wrote about the garbage data problem in companies: 90% of data is unstructured and almost nobody knows how to process it.

Why more data can be worse

Sounds counterintuitive, but it makes sense when you think about it:

Accumulated noise. More data means more chances to include incorrect examples, wrong labels, or edge cases that confuse the model. A dataset of 1 million records with 5% errors has 50,000 problems. One with 10 million at the same rate has 500,000.

Amplified bias. If your data source has biases, scaling only amplifies those biases. It doesn’t dilute them.

Computational cost. Training with unnecessary or redundant data is throwing away money and energy. Literally.

Overfitting to spurious patterns. With enough noisy data, the model can find correlations that don’t exist in the real world.

The three C’s of Data-Centric AI

1. Curation

Not all data deserves to enter your training dataset. Curation means actively selecting what to include and what to discard.

This requires understanding your domain. A data scientist who doesn’t know the business cannot curate effectively. It’s impossible to distinguish signal from noise without context.

2. Consistency

Inconsistent labels are the silent killer of ML models. If two annotators label the same case differently, you’re introducing noise that no model can resolve.

The solution isn’t more data. It’s improving annotation guidelines, measuring inter-annotator agreement, and resolving ambiguities before they reach the model.

3. Contextualization

Data without context is dangerous. Is that spike in sales real or a system error? Does that sensor anomaly indicate a problem or a legitimate outlier?

Documenting your data’s context—how it was collected, what the fields mean, what limitations it has—is as important as the data itself.

Tools for implementing Data-Centric AI

Great Expectations: Open-source framework for data validation. Define “expectations” about your data (this column should never be null, this value should be between X and Y) and the system verifies them automatically.

Evidently AI: Data drift monitoring and model quality in production. Detects when your data starts diverging from expected before the model fails.

Label Studio: Annotation platform that facilitates consistency between annotators and allows iterating on labeling guides.

DVC (Data Version Control): Git for data. Version your datasets just like you version code. Fundamental for reproducibility.

Cleanlab: Automatically detects labeling errors in your datasets. Black magic that works surprisingly well.

Practical case: the inverted 80/20

In traditional ML projects, it was assumed that 80% of time went to data preparation and 20% to modeling. The reality in many teams was that 80% was spent cleaning data “good enough” to move to modeling as quickly as possible.

The Data-Centric approach inverts the priority: that 80% should be dedicated to genuinely improving data, not patching it. A simple model with excellent data consistently beats a complex model with mediocre data.

Implications for data engineers

If your role is building data pipelines, this affects you directly:

Pipelines aren’t just ETL. They must include continuous quality validation.
Data observability is as critical as application observability.
You need to collaborate more closely with data scientists and domain experts.
Data versioning is not optional.

If you want to go deeper into these practices, my data engineering guide covers the role fundamentals.

The future is hybrid

Data-Centric AI doesn’t mean ignoring advances in model architectures. LLMs, transformers, fine-tuning techniques… all of that remains important.

But the real competitive differentiator is increasingly in who has better data, not who has the biggest model. And improving data is a problem of engineering, processes, and domain knowledge. Exactly where data engineers can add the most value.

Have you implemented Data-Centric AI practices on your team? What tools do you use for data validation?