Improve Managing Fivetran related metadata
Which connector?:
At a minimum this pertains to fivetran_metadata and Salesforce related connections
Additional details:
These two connections (and possibly others) create multiple metadata related tables with data that accumulates quickly. The management and data retention of these tables falls completely on the customer.
At least three of these tables collect a lot of data:
- fivetran_metadata.log
- <<salesforce>>.fivetran_api_call
- <<salesforce>>.fivetran_query
A common data retention practice is to keep X number days/months/years/etc of data and these tables have columns that can support that. However they do not have indexes to assist with this, thus making DELETE statements slow and expensive. Adding an index is possible, but this is an added hassle placed upon the customer to manage.
An even easier way to manage these tables would be by partitioning by date range. When the Destination is a PostgreSQL cluster this is made very easy with the pg_partman extention. Once managed by pg_partman, adding partitions to and dropping old ones from a table according to a company's data retention policies is easy.
The first two tables - fivetran_metadata.log and - <<salesforce>>.fivetran_api_call - can support this because their existing primary key constraint contains the column - time_stamp and start respectively - that would be used as the partitioning column in PostgreSQL partitioning. You can create the parent partition table with the subsequent child partitions, PostgreSQL's partitioning kicks in, and Fivetran continues to work no problem.
The third table <<salesforce>>.fivetran_query has a column that would support this and make management easier, but it is not part of the primary key constraint and thus partitioning cannot work on it. Further, it cannot be changed because a Fivetran sync will notice the discrepancy between the target DDL and the expected DDL, drop the new primary key constraint, and try to re-create the existing primary key constraint which fails.
At a minimum, please update the start column to NOT NULL and the primary key constraint for <<salesforce>>.fivetran_query from just "id" to "id, start". The id column is already unique, so adding start will not affect the uniqueness. I have not seen an example where start is NULL so changing it to NOT NULL should not cause any problems either.
Even better, please review these metadata features and put some effort into making easier to manage across the board. At the very least, maybe make them an optional feature that is disabled by default for net new connections?
Please sign in to leave a comment.
Comments
0 comments