VertiPaq for humans: how Power BI compresses your data
TL;DR
- VertiPaq uses 3 techniques: Value Encoding (reduce range), Dictionary Encoding (indices), RLE (sequences)
- Less cardinality = better compression (round decimals, clean duplicates)
- A single outlier can ruin the entire column
- DAX Studio + VertiPaq Analyzer to diagnose which column weighs most
VertiPaq is Power BI’s compression engine. It’s what lets you load millions of rows into memory and have queries fly. But nobody explains how it actually works.
Today we fix that.
The problem: wasted bits
The number 22 in binary is 10110. Five bits.
But if you store it as a 32-bit integer, you have 27 leading zeros doing nothing. Wasted space.
Multiply that by millions of rows and you understand why your model weighs so much.
Technique 1: Value Encoding
Imagine a column with values between 194 and 216. To store 216 you need 8 bits (2^8 = 256).
The trick: subtract the minimum from all values.
- Original: 194, 200, 216 → needs 8 bits
- With offset of 194: 0, 6, 22 → needs 5 bits
Three fewer bits per row. With 10 million rows, that’s a lot of RAM.
The problem: a single outlier ruins the whole column. If you have values from 1 to 100 but someone entered 999999, the entire column needs more bits because of that one value.
Technique 2: Dictionary Encoding
You have a “Color” column with values: Red, Red, White, Black, Blue, Red, Blue, Black, Black.
Instead of storing repeated text, VertiPaq creates a dictionary:
| ID | Color |
|---|---|
| 0 | Red |
| 1 | White |
| 2 | Black |
| 3 | Blue |
And the column only stores: 0, 0, 1, 2, 3, 0, 3, 2, 2
Four unique values = 2 bits per row (2² = 4). Instead of storing “Black” (40+ bits per string) you store a 2 with 2 bits.
Cardinality matters: fewer unique values, better compression. A column with 4 colors compresses beautifully. A column with 1 million unique IDs doesn’t compress at all.
Technique 3: Run Length Encoding (RLE)
If your data is sorted:
Q1, Q1, Q1, Q1, Q1 (310 times)
Q2, Q2, Q2, Q2, Q2 (290 times)
Instead of 600 rows, VertiPaq stores: (Q1, 310), (Q2, 290). Two entries.
The trick: only works if equal values are consecutive. If the column is unsorted (Q1, Q2, Q1, Q2…), RLE compresses nothing.
VertiPaq decides internally how to sort to maximize RLE. You don’t control this directly.
What you CAN do
You don’t control the algorithm, but you control the material you feed it:
Reduce cardinality:
- Round unnecessary decimals (3.14159 → 3.14)
- Use date instead of datetime if you don’t need the time
- Clean duplicates (“Madrid”, “madrid”, “MADRID” → “Madrid”)
- Don’t import columns you don’t use
Reduce range:
- Handle outliers before importing
- Consider splitting problematic columns
Use correct types:
- Numbers as numbers, not as text
- Numeric keys compress better than text keys
Model well:
- Dimensions compress better than repeating values in the fact
- Star schema exists for a reason
Calculated columns vs measures
Calculated column: runs at refresh time, gets physically stored, takes up space.
Measure: only stores the formula, runs when someone looks at the visual.
That’s why “convert calculated columns to measures when you can” is an optimization rule. Same result, less storage.
Diagnosis
When your model is too heavy:
- DAX Studio - VertiPaq Analyzer tells you exactly which column takes up how much
- Look for high-cardinality columns that shouldn’t be
- Look for text columns that could be numeric IDs
- Check if there are columns you don’t use but imported “just in case”
Summary
VertiPaq uses three main techniques: Value Encoding (reduce range), Dictionary Encoding (indices instead of values), and RLE (compress sequences).
You don’t control the algorithm, but you control the quality of the material you give it. Less cardinality, less range, correct types, clean model.
The difference between a 2GB model and a 200MB model is usually in decisions made before writing a single line of DAX.
Bonus: VertiPaq and LLM Quantization
Working with Power BI, I wondered: can you apply these techniques to the weights of a language model?
Why it doesn’t work directly:
VertiPaq assumes sortable data with small differences, repeated values (RLE), and discrete columns. LLM weights are floats distributed ~N(0, σ), almost all unique, with no natural order.
But there are analogous techniques:
| VertiPaq | ML equivalent |
|---|---|
| Dictionary encoding | Weight clustering (K-means on weights, store indices) |
| Bit-packing | Quantization (float32 → int8/int4) |
| RLE | Pruning (remove weights ~0) |
| Value encoding | Delta encoding for federated learning updates |
Weight clustering is closest to dictionary encoding:
# Conceptually:
original_weights = [0.0012, 0.0015, 0.0011, 0.8234, 0.8229, 0.8241]
# Clustering (k=2):
centroids = [0.00127, 0.82347] # "dictionary"
indices = [0, 0, 0, 1, 1, 1] # references
# Compression: 6 floats → 2 floats + 6 small indices
Quantization (int4/int8) won because it’s simpler and modern hardware has optimized instructions. Weight clustering requires a lookup step that adds latency.
The connection: Both worlds (BI and ML) solve the same problem: representing information with fewer bits without losing what matters. VertiPaq for fast analytical queries, quantization for inference on GPUs with limited VRAM.
Ideas travel between domains. Understanding compression in one context gives you an advantage in another.
To clean your data before it hits VertiPaq, master Power Query—the transformations you do there directly impact compression ratios. And if you’re seeing weird numbers after loading, the problem might be in your ETL, not the model itself.
You might also like
Rotating Measures in DAX: When your data won't stay put
How to handle products that change categories over time in DAX.
Power BI vs Tableau: Which is better in 2026?
Real comparison of Power BI vs Tableau: pricing, learning curve, job demand, and when to use each one.
Rolling 12 months in DAX: the two-calendar solution
How to calculate rolling 12 months in Power BI correctly. Step-by-step DAX tutorial.