Connector Improvement: Connector Improvement: Feature Request: Valid time table was last attempted to by sync'd
AnsweredCurrent Gap:
We are a large Oracle shop in the Oracle RDBMS is the backend for a majority of our Enterprise ERP systems. This includes Peoplesoft, Ellucian Banner, and several other large data marts on campus. We're using Fivetran to sync these sources into Snowflake and then using dbt to transform the data for a multitude of use cases. There is a gap in the process, in that many of the tables that Fivetran is syncing are not updated or changed on a daily basis, and this presents a problem with our source_feshness tests, where our dbt builds will throw errors as _FIVETRAN_SYNCED is reporting the last time the source table changed and not the last time a sync was made.
The release of dbt-core 1.7.0 includes "Allow freshness to be determined via DBMS metadata for supported adapters (dbt-labs/dbt-core#8704)". In short, rather than checking for max(_fivetran_synced), which will fail on tables that are infrequently updated, it checks the table metadata, like this:
select table_schema as schema,
table_name as identifier,
last_altered as last_modified
from information_schema.tables
where upper(table_name) = 'PS_TERM_TBL'
At face, this is essentially the same as max(_fivetran_synced). HOWEVER, if we were to enable a hook, or if Fivetran were to build this into their product, then after a table is checked for refreshing and regardless of if rows were updated, we could call the following statement:
UPDATE PS_TERM_TBL
SET _FIVETRAN_SYNCED = current_timestamp()
WHERE 1 = 2
This would clearly not update _FIVETRAN_SYNCED, but it would update the information_schema.tables.last_altered date. In sum, this would close the broken chain and give us true freshness in dbt.
Alternate strategy: if the above strategy does not appeal to Fivetran, an alternative recommendation is for Fivetran to maintain an additional column on their custom columns called _fivetran_last_refreshed or something like that. In my opinion, this is not as ideal because it requires either 1 or all rows to be updated every time there's a refresh check, which is more expensive.
-
Official comment
Hey Todd,
I appreciate you going into this level of detail around your use case. You *could* potentially use our sync_end event as a webhook to drive the SQL you mentioned, which could then be queried by dbt source freshness. However, table-level updates are not articulated in our webhook events.
Let me know if that would help solve the issue or if it's not granular enough for your use case.
Jimmy
Please sign in to leave a comment.
Comments
1 comment