Data Engineering Guide: From Excel to Professional Pipelines

TL;DR

Data Engineering = building infrastructure so data flows reliably
It’s not Data Science (they analyze, we build the pipes)
80% of the job is cleaning data, not glamorous stuff
Basic stack: SQL + Python + an orchestrator (Airflow, Prefect)
If you come from Excel/Power BI, you have more foundation than you think

What Data Engineering Really Is

A Data Engineer builds and maintains the infrastructure that allows data to flow from point A to point B reliably.

Point A: Databases, APIs, files, sensors, logs, whatever. Point B: Data warehouses, dashboards, ML models, applications.

Your job is making that flow work. Always. Without errors. At scale.

What It’s NOT

Not Data Science: They build predictive models. You build the pipeline that feeds those models.
Not Data Analysis: They extract insights from data. You make sure they have clean data to analyze.
Not DBA: They optimize databases. You move data between systems.

There’s overlap, especially in small companies. But the specialization is real.

The Real Day-to-Day

If you think a Data Engineer spends their day designing elegant architectures… not exactly.

What I do 80% of the time

1. Clean data

Dates in 15 different formats
Fields that should be numeric but have “N/A”, ”-”, ” ”
Duplicates that you don’t know if they’re duplicates or different records
Broken encoding (goodbye ñ, hello Ã±)

2. Debug pipelines

“Yesterday’s dashboard is empty” → vendor’s SFTP changed IP
“Numbers don’t match” → someone changed a column in the ERP
“The process took 6 hours” → a query without an index

3. Convince people

“But my Excel works fine” → until it doesn’t
“Why can’t we use Google Sheets?” → because 500MB of data
“This is urgent” → everything is urgent

4. Document (or regret not doing it)

“What does this 2019 script do?” → nobody knows, the author left

What I do 20% of the time

Design new pipelines
Evaluate tools
Optimize existing processes
Learn new things

That 20% is the fun part. But the 80% pays the bills.

The Modern Stack (2026)

Level 1: The minimum you need

Tool	For what
SQL	Query and transform data in databases
Python	Scripts, automation, APIs
Git	Version control (yes, for data too)
Terminal/Bash	Basic automation

With this you can do 70% of the work. Seriously.

Level 2: The typical company stack

Category	Popular options
Orchestration	Airflow, Prefect, Dagster
Transformation	dbt, Spark, pandas
Storage	Snowflake, BigQuery, Databricks, Redshift
Ingestion	Fivetran, Airbyte, Singer
Quality	Great Expectations, dbt tests, Soda

Level 3: What large companies add

Data Catalog: Datahub, Amundsen, Atlan
Governance: Collibra, Alation
Monitoring: Monte Carlo, Bigeye
Streaming: Kafka, Kinesis, Flink

My recommendation to start

Solid SQL (not basic, solid)
Python with pandas and requests
One orchestrator (Prefect is easier than Airflow)
dbt for transformations
A warehouse (BigQuery has a generous free tier)

With that you can work at 90% of companies.

ETL vs ELT: The Paradigm Shift

ETL (Extract, Transform, Load) - The Classic Approach

Source → [Transform] → Data Warehouse

You transform data BEFORE loading. Requires knowing what you need upfront.

Pros: Less data in warehouse, more control. Cons: Less flexible, more complex pipelines.

ELT (Extract, Load, Transform) - The Modern Approach

Source → Data Warehouse → [Transform]

You load everything and transform INSIDE the warehouse using SQL.

Pros: More flexible, tools like dbt make it elegant. Cons: More data stored, storage costs.

The Reality

Most modern projects use ELT because:

Storage is cheap
Modern warehouses are very fast
dbt makes transformations elegant
You can iterate without rebuilding pipelines

But ETL still makes sense for:

Sensitive data that shouldn’t arrive raw
Very large volumes where transform reduces a lot
Legacy systems that don’t support the modern approach

Data Quality: 80% of the Job

The prettiest data is useless if it’s wrong. In fact, 90% of your data is garbage and needs cleaning before it’s useful.

Typical Problems

Problem	Example
Completeness	30% of emails empty
Uniqueness	Customer duplicates
Consistency	”USA”, “US”, “usa”, “United States”
Validity	Future dates in “birth date”
Accuracy	Incorrect rounding
Timeliness	3-month-old data as “current”

How We Attack It

1. Validation at ingestion

# Simple example with Great Expectations
expect_column_values_to_not_be_null("email")
expect_column_values_to_match_regex("email", r".*@.*\..*")

2. Tests in transformations

-- dbt test
SELECT * FROM customers
WHERE created_at > CURRENT_DATE  -- Should return no records

3. Continuous monitoring

Alerts when metrics deviate
Data quality dashboards
Automatic anomaly detection

4. Clear ownership

Who is responsible for each dataset?
Who approves definition changes?
Who responds when something fails?

If You Come From Excel/Power BI

Good news: you have more foundation than you think.

What You Already Know (even if you don’t call it that)

In Excel/Power BI	In Data Engineering
VLOOKUP, INDEX/MATCH	JOINs
Pivot tables	GROUP BY, aggregations
Power Query	ETL/ELT
Table relationships	Dimensional modeling
Macros/VBA	Python scripts
Data connections	Data ingestion

The Transition Path

Step 1: Deep SQL

Not just SELECT, but window functions, CTEs, subqueries
Query optimization
Different dialects (SQL Server, PostgreSQL, BigQuery)

Step 2: Basic Python

pandas for data manipulation
requests for APIs
Automating repetitive tasks

Step 3: A real project

Automate something you do manually
Move data from A to B with a script
Schedule a daily execution

Step 4: The ecosystem

Git to version your code
An orchestrator to schedule jobs
A warehouse to store results

Power Query is literally an ETL tool. If you master it, you already understand the concepts. You just need to translate them to “grown-up” tools.

The Future: Data Engineering + AI

2026 is changing the role:

What AI already does well

Generate basic SQL queries
Write simple transformations
Document existing code
Detect data anomalies

What remains human

Understanding the business
Designing architectures that scale
Debugging complex problems
Deciding what data matters

My prediction

The junior Data Engineer who only knows how to run queries has numbered days. The one who understands the business, designs systems, and uses AI as a tool… that one has a future.

How to Start (3-Month Plan)

Month 1: Fundamentals

Advanced SQL (window functions, CTEs)
Basic Python (pandas, files, APIs)
Basic Git

Month 2: Tools

One orchestrator (Prefect or Airflow)
dbt for transformations
A warehouse (BigQuery free tier)

Month 3: Real Project

Choose a real problem (your data, public data)
Complete pipeline: ingestion → transformation → output
Documentation and basic tests

At the end of 3 months you’ll have something to show in interviews.

Recommended Resources

To Start

Mode SQL Tutorial - Free and well-explained SQL
Dataquest - Python for data
The Data Engineering Handbook - Community resource

To Go Deeper

“Fundamentals of Data Engineering” (Reis & Housley) - The reference book
dbt Learn - Free official courses
Data Engineering Zoomcamp - Free bootcamp

Communities

Data Engineering Discord
r/dataengineering on Reddit
Locally Optimistic (Slack)

Conclusion

Data Engineering is building the infrastructure that makes data flow. 80% is cleaning data and debugging pipelines. 20% is designing elegant things.

If you come from Excel/Power BI, you already have the mindset. You just need to learn professional tools: solid SQL, Python, an orchestrator, dbt.

The field is growing, salaries are good, and AI won’t replace you if you understand the business beyond the technology.

Start with SQL. Really. Everything else comes after. As you advance, you might explore more sophisticated approaches like Data Fabric architecture for complex enterprise environments.