When Partitioning and Clustering Go Wrong: Lessons from Optimizing Queries
Recently, I worked with a table in BigQuery called orders, sized at 2.98 GB. My goal was to optimize query performance and reduce the amount of data scanned for cost efficiency. Here’s how the journey unfolded:
The Initial Query
I ran the following query on the unpartitioned table:
SELECT *
FROM foodwagon.orders
WHERE order_date BETWEEN '2020-01-01' AND '2022-01-31'
AND restaurant_ratings >= 4 AND restaurant_ratings < 5;
Data scanned: 2.98 GB.
First Optimization: Partitioning
Since the query frequently filtered by order_date, I decided to partition the table on order_date. After partitioning: Data scanned: 1.24 GB.
This was a decent improvement, but I wanted to push the optimization further.
Next Step: Clustering
Given that most of my queries also filtered on restaurant_ratings, clustering seemed like the logical next step. However, restaurant_ratings is a float column with high cardinality, which isn't ideal for clustering—it results in minimal gains.
To address this, I created a derived column rounded_ratings by flooring the restaurant_ratings to the nearest integer:
CREATE TABLE foodwagon.orders_clustered
PARTITION BY DATE_TRUNC(order_date, MONTH)
CLUSTER BY rounded_ratings
AS
SELECT *, CAST(FLOOR(restaurant_ratings) AS INT64) AS rounded_ratings
FROM foodwagon.orders;
This approach reduced the scanned data for my query to 283.03 MB, which is a 77% reduction compared to partitioning alone and a 90% reduction from the original table scan. Huge success!
The Unexpected Outcome
However, things got interesting when I ran a slightly modified query:
SELECT *
FROM foodwagon.orders_clustered
WHERE order_date BETWEEN '2020-01-01' AND '2022-01-31';
This query scanned 1.4 GB, which is more than the partitioned table scan (1.24 GB). Upon inspecting the clustered table, I found its size had grown to 3.35 GB—larger than the original table.
Why Did This Happen?
Clustering Increases Storage Size: Clustering creates metadata to optimize how data is stored and queried. This metadata adds overhead, especially when combined with partitioning, which can lead to a larger table size.
Inefficient Clustering for Non-Filtered Queries: Clustering only helps queries that filter on clustered fields. In this case, rounded_ratings wasn’t part of the query, so BigQuery couldn’t leverage the clustering metadata, resulting in a larger data scan.
Cluster Size Impacts Query Efficiency: Clustering benefits depend on how well the clustered field aligns with query patterns. High cardinality or sparsely filtered clusters can offset performance gains, especially for large datasets.
Partitioning vs. Clustering: When to Use What?
Partitioning:
Use for low-cardinality fields like dates or categories.
Significant benefits when queries heavily filter on the partitioned column.
Adds minimal overhead to storage.
Clustering:
Works well for fields frequently used in filters or joins, especially with moderate cardinality.
Avoid clustering on high-cardinality fields (e.g., floats) without transforming the data.
Be cautious when combining clustering with partitioning; it can lead to larger table sizes if not used wisely.
Key Takeaways
Understand Query Patterns: Before deciding on partitioning or clustering, analyze query requirements. In my case, rounded_ratings was a good clustering field because most queries filtered on it.
Quantify Benefits: Partitioning reduced the scanned data by 58%, and clustering brought it down by an additional 77%. However, the tradeoff was increased table size and complexity.
Balance Partitioning and Clustering: Combining both can be powerful but may backfire if clusters don’t align with query patterns.
By understanding these strategies and tradeoffs, I was able to reduce costs and improve query performance significantly, despite some unexpected challenges