Question
Should I manually add clustering and/or partitioning to my BigQuery data?
Environment
Destination: BigQuery
Answer
It is recommended that you manually partition and cluster your data into BigQuery for optimal performance of the Fivetran sync.
Partitioning your data into Google Big Query will impact the cost and performance of writing into GCP. You can partition your data using 2 main strategies: you can use a table column (TIMESTAMP
, DATE
, or INTEGER
column) or you can use the data’s time of ingestion.
What should I partition by?
If partitioning by TIMESTAMP
or data’s time of ingestion, this approach can be very useful if you want to run analytics for a specific period of time. Partitioning allows BigQuery to read and only process the rows of the specific time span. This means faster and more cost-efficient queries.
INTEGER
partitioning allows you to store a range of values in the same partition. You will have to identify the min and max values for the range size, but this could work for user id, geo coordinates, or even zip code. Please visit our Documentation for more information.
Clustering will address better read query performance as it tells BigQuery to store your data by certain fields.
What should I cluster by?
Think about what columns people will query your table and the order of the columns. Your queries commonly use filters or aggregation against multiple particular columns. Please visit our Documentation for more information.
Considerations
Clustering can be done on non-partitioned and partitioned data. There’s no limit on how many values a clustered column might have whereas there is a limitation of 4000 partitions within BigQuery. When partitioning on TIMESTAMP
field or by ingestion time both are limited with day granularity. That is why it is important to identify the correct