Data Engineering Guide: From Excel to Professional Pipelines

· 7 min read
Share:

TL;DR

  • Data Engineering = building infrastructure so data flows reliably
  • It’s not Data Science (they analyze, we build the pipes)
  • 80% of the job is cleaning data, not glamorous stuff
  • Basic stack: SQL + Python + an orchestrator (Airflow, Prefect)
  • If you come from Excel/Power BI, you have more foundation than you think

What Data Engineering Really Is

A Data Engineer builds and maintains the infrastructure that allows data to flow from point A to point B reliably.

Point A: Databases, APIs, files, sensors, logs, whatever. Point B: Data warehouses, dashboards, ML models, applications.

Your job is making that flow work. Always. Without errors. At scale.

What It’s NOT

  • Not Data Science: They build predictive models. You build the pipeline that feeds those models.
  • Not Data Analysis: They extract insights from data. You make sure they have clean data to analyze.
  • Not DBA: They optimize databases. You move data between systems.

There’s overlap, especially in small companies. But the specialization is real.


The Real Day-to-Day

If you think a Data Engineer spends their day designing elegant architectures… not exactly.

What I do 80% of the time

1. Clean data

  • Dates in 15 different formats
  • Fields that should be numeric but have “N/A”, ”-”, ” ”
  • Duplicates that you don’t know if they’re duplicates or different records
  • Broken encoding (goodbye ñ, hello ñ)

2. Debug pipelines

  • “Yesterday’s dashboard is empty” → vendor’s SFTP changed IP
  • “Numbers don’t match” → someone changed a column in the ERP
  • “The process took 6 hours” → a query without an index

3. Convince people

  • “But my Excel works fine” → until it doesn’t
  • “Why can’t we use Google Sheets?” → because 500MB of data
  • “This is urgent” → everything is urgent

4. Document (or regret not doing it)

  • “What does this 2019 script do?” → nobody knows, the author left

What I do 20% of the time

  • Design new pipelines
  • Evaluate tools
  • Optimize existing processes
  • Learn new things

That 20% is the fun part. But the 80% pays the bills.


The Modern Stack (2026)

Level 1: The minimum you need

ToolFor what
SQLQuery and transform data in databases
PythonScripts, automation, APIs
GitVersion control (yes, for data too)
Terminal/BashBasic automation

With this you can do 70% of the work. Seriously.

Level 2: The typical company stack

CategoryPopular options
OrchestrationAirflow, Prefect, Dagster
Transformationdbt, Spark, pandas
StorageSnowflake, BigQuery, Databricks, Redshift
IngestionFivetran, Airbyte, Singer
QualityGreat Expectations, dbt tests, Soda

Level 3: What large companies add

  • Data Catalog: Datahub, Amundsen, Atlan
  • Governance: Collibra, Alation
  • Monitoring: Monte Carlo, Bigeye
  • Streaming: Kafka, Kinesis, Flink

My recommendation to start

  1. Solid SQL (not basic, solid)
  2. Python with pandas and requests
  3. One orchestrator (Prefect is easier than Airflow)
  4. dbt for transformations
  5. A warehouse (BigQuery has a generous free tier)

With that you can work at 90% of companies.


ETL vs ELT: The Paradigm Shift

ETL (Extract, Transform, Load) - The Classic Approach

Source → [Transform] → Data Warehouse

You transform data BEFORE loading. Requires knowing what you need upfront.

Pros: Less data in warehouse, more control. Cons: Less flexible, more complex pipelines.

ELT (Extract, Load, Transform) - The Modern Approach

Source → Data Warehouse → [Transform]

You load everything and transform INSIDE the warehouse using SQL.

Pros: More flexible, tools like dbt make it elegant. Cons: More data stored, storage costs.

The Reality

Most modern projects use ELT because:

  • Storage is cheap
  • Modern warehouses are very fast
  • dbt makes transformations elegant
  • You can iterate without rebuilding pipelines

But ETL still makes sense for:

  • Sensitive data that shouldn’t arrive raw
  • Very large volumes where transform reduces a lot
  • Legacy systems that don’t support the modern approach

Data Quality: 80% of the Job

The prettiest data is useless if it’s wrong. In fact, 90% of your data is garbage and needs cleaning before it’s useful.

Typical Problems

ProblemExample
Completeness30% of emails empty
UniquenessCustomer duplicates
Consistency”USA”, “US”, “usa”, “United States”
ValidityFuture dates in “birth date”
AccuracyIncorrect rounding
Timeliness3-month-old data as “current”

How We Attack It

1. Validation at ingestion

# Simple example with Great Expectations
expect_column_values_to_not_be_null("email")
expect_column_values_to_match_regex("email", r".*@.*\..*")

2. Tests in transformations

-- dbt test
SELECT * FROM customers
WHERE created_at > CURRENT_DATE  -- Should return no records

3. Continuous monitoring

  • Alerts when metrics deviate
  • Data quality dashboards
  • Automatic anomaly detection

4. Clear ownership

  • Who is responsible for each dataset?
  • Who approves definition changes?
  • Who responds when something fails?

If You Come From Excel/Power BI

Good news: you have more foundation than you think.

What You Already Know (even if you don’t call it that)

In Excel/Power BIIn Data Engineering
VLOOKUP, INDEX/MATCHJOINs
Pivot tablesGROUP BY, aggregations
Power QueryETL/ELT
Table relationshipsDimensional modeling
Macros/VBAPython scripts
Data connectionsData ingestion

The Transition Path

Step 1: Deep SQL

  • Not just SELECT, but window functions, CTEs, subqueries
  • Query optimization
  • Different dialects (SQL Server, PostgreSQL, BigQuery)

Step 2: Basic Python

  • pandas for data manipulation
  • requests for APIs
  • Automating repetitive tasks

Step 3: A real project

  • Automate something you do manually
  • Move data from A to B with a script
  • Schedule a daily execution

Step 4: The ecosystem

  • Git to version your code
  • An orchestrator to schedule jobs
  • A warehouse to store results

Power Query is literally an ETL tool. If you master it, you already understand the concepts. You just need to translate them to “grown-up” tools.


The Future: Data Engineering + AI

2026 is changing the role:

What AI already does well

  • Generate basic SQL queries
  • Write simple transformations
  • Document existing code
  • Detect data anomalies

What remains human

  • Understanding the business
  • Designing architectures that scale
  • Debugging complex problems
  • Deciding what data matters

My prediction

The junior Data Engineer who only knows how to run queries has numbered days. The one who understands the business, designs systems, and uses AI as a tool… that one has a future.


How to Start (3-Month Plan)

Month 1: Fundamentals

  • Advanced SQL (window functions, CTEs)
  • Basic Python (pandas, files, APIs)
  • Basic Git

Month 2: Tools

  • One orchestrator (Prefect or Airflow)
  • dbt for transformations
  • A warehouse (BigQuery free tier)

Month 3: Real Project

  • Choose a real problem (your data, public data)
  • Complete pipeline: ingestion → transformation → output
  • Documentation and basic tests

At the end of 3 months you’ll have something to show in interviews.


To Start

To Go Deeper

Communities

  • Data Engineering Discord
  • r/dataengineering on Reddit
  • Locally Optimistic (Slack)

Conclusion

Data Engineering is building the infrastructure that makes data flow. 80% is cleaning data and debugging pipelines. 20% is designing elegant things.

If you come from Excel/Power BI, you already have the mindset. You just need to learn professional tools: solid SQL, Python, an orchestrator, dbt.

The field is growing, salaries are good, and AI won’t replace you if you understand the business beyond the technology.

Start with SQL. Really. Everything else comes after. As you advance, you might explore more sophisticated approaches like Data Fabric architecture for complex enterprise environments.

Found this useful? Share it

Share:

You might also like