Author: Lynton Mack
Know your column store indexing!
These are not databases!
Recently, I had a case from one of our Financial Institute clients where a Power BI dataset containing daily account balances was over 1GB in size and not refreshing in the service, and I was tasked to reduce its size.
I managed to shrink it to about 350 MB.
This was done by ADDING 2 fields.
Unlike click-bait, like those darned online recipes (just give me the recipe; I don’t care how your holiday in Italy in 2010 was), I will give you the results first and then tell you how I did it.
Short story – it’s about understanding and using column store indexes properly.
I took a 600 MB table and reduced it to 40 MB in testing. By the time I was done, I’d removed even more than that.
That’s a reduction in the size of over 90%!!! Or more than a tenth of the original size.
So, here’s the process I went through and why ADDING extra fields can REDUCE a table’s size.
OK, so I said I’d just tell you how I did it up front, but I can’t really explain why adding more data reduces the overall size without some basic index theory to help explain it. I’ll keep it brief, as there is plenty of good content elsewhere to learn more – which you should look up!
Data is not stored in clouds, the air or ether. It is stored on real physical things. Hard drives – with disks or solid-state – and RAM. Just because data is on an internet service does not mean it is not stored on hardware. I’m not going to delve into partition sizes, blocks and such here. Maybe another time.
Indexes contain pointers to physical locations on storage as to where to scan and find the data. The aim of a well-tuned index is to reduce the number of times the hardware needs to be scanned to return our results. Fewer scans equal quicker results.
A column store index is best suited for finding groups of records, while a row store is good for finding individual records. They can be used together, and when they are, they can really rock. I’ve used this to save slow database systems from being dumped and rebuilt when applied correctly.
Here is the world’s best way to understand row and column index theory:
In Power BI, we tend to report mainly on groups of records – a monthly total, an annual forecast, product sales by category, or gender diversity in the workplace – which is why it is driven by column store indexing. We are running measures over groups of records.
Back to the story…
Whenever I am debugging Power BI performance and size issues, I always start with DAX Studio, where I run a simple query to analyse the field sizes – takes minutes to set up and perform.
Tip: DAX Studio has a runtime version that doesn’t need to be installed. Great for when you don’t have admin permissions on the local device.
And immediately we can quickly see the culprit for the size of the data model – a single 616Mb field…
That table had just two columns.
A Key field that represented the date and account and then a numeric value for the balance. [BalanceKey] and [Balance]
When I look at the next row in the results, I can see the same column name on another table, but it is only 40Mb.
So why is this one 616 MB?
Because the index has one record every time the column’s value changes – which is one index record for EVERY one of the many millions of records, and each of these points to some physical location. I think this table had over 6 million rows.
Now, imagine what happens when I want this month’s balances… We need to scan a different location on the hardware for every single record using this index.
By adding the DATE and ACCOUNT fields to the Balance table, it provides a way of grouping those key values, and this is what reduces the size of the index because there would be a single record for a date with all the account balances for that date. All the balances for that date are also in one physical spot.
Just adding these new fields would give some improvement, but it could make things worse if the records were still ordered by their Key. ORDERING these fields keeps groups of records together in the most performant way for the column store index. I always order by the least variable to the most variable for the best compression.
Power BI is smart, but it doesn’t do a full scan of your data to determine its index tuning, just a subset. Adding ORDER BY clauses to your queries forces your data to be logically grouped TOGETHER.
Once these two fields were added and the index updated, you can see that the Balance is now only 40Mb in this table – not 600Mb.
Instead of 6 million index records, we have 2023 * [number of Accounts].
So, for two years and 10,000 accounts, that is approx. 730days * 10,000 accounts = 730,000.
Around 1/10th of the original size.
This query tuning does not just impact size; it directly impacts performance.
Small models are fast models. Reports back from the business are “What did you do? It’s so quick now!”
That’s the basic message I wanted to get across – know your indexing theory and how to apply it.
- Think in result sets (groups), not rows,
- ORDER your result sets when loading to reduce churn for best compression and performance,
- Don’t be afraid to add columns with common values to a table full of unique values – they really help with indexing,
- Do some heavy lifting in SQL if you need to.
The story continues…
Further to that above, I found and made some more changes – feel free to stop here, it’s not super relevant, but it is interesting from a modelling techniques perspective.
What follows are the corrections made to a model built by a DATABASE person. There is an assumption that transitioning from developing databases to models is easy. It’s not. It’s a whole different field with a different mindset required to deliver top-quality models. It’s OLTP versus OLAP.
Anyway, I had fixed this issue, and then I looked further and worked out that the Balance table was only there to provide the balance for the measure of Sum(Balance) in the DepositDate table.
So the original model looked like this…
The rate table also turned out to be the same – providing a summarised value to the DepositDate table.
These tables were added and linked to using their key values to return a value that would be aggregated.
Looking further, I then found Maturity and Interest tables as well… All these tables with single values, linked on keys and filtered directionally on 1 to many joins…
This is database thinking! Certainly, this is how the data could be stored for efficient updates, inserts and deletes – ROW operations. But this is not the way tabular modelling is meant to be done.
I added all of these fields to the underlying query of the DepositDate table.
I used SQL Management Studio to check the execution plan and considered and applied ROW indexes to the underlying tables.
Despite the extra joins in the query, they are mostly inner joins, and it all now runs quicker than the original ones did.
Cleaning it all up leaves us with just one table instead of 6 tables…
Not only was the difference in size astounding, but the queries populating it ran faster (with the correct indexes on the SQL tables also created along with query hints) and the model is simpler for self-service usage.
I dug around a bit more and found a number of tables in another area with 1:1 cardinality.
These should not exist in a model (without a REALLY good cause). If it is 1:1, then the granularity of the records is the same, and these should be fields on the main fact table. There is no value in having them as extra tables and they just get in the way and slow things down.
A few dimensions were also found that had simple lookup codes and names, no other attributes. Again, due to column store indexing, we can often get better compression by adding both the code and name fields from the dimension table to the fact table and then removing the dimension tables.
In my honest opinion, dimensions are for filtering when there is more than one option to filter by – so 3 or 4 fields, not just 1 or 2.
By the time I was done at the end of the day (literally, less than a single day), the database was around 30% of the original size, eight tables had been removed, and it was performing better across the board in all aspects.
What were dreary reports that people waited to refresh were now lightning fast – they thought they weren’t refreshing properly – they were that quick!
The model was streamlined and now a lot easier to consider for self-service use as well since it was simpler in layout and made sense to business users rather than data engineers.