Community

Connector Improvement: Same sync time across tables

Mohammed Khalil User

March 29, 2022 14:20

Say I have 2 tables - Orders and Customers. It is possible that you sync the Customers table first, then Orders, but in-between that time a new order is placed - so my data warehouse will see an Order without corresponding Customer.

So it would be good to have a cutoff for all tables to be the time the sync started, for example. This is a potential issue regardless of the sync frequency.

Please sign in to leave a comment.

Comments

5 comments

Official comment

Fraser User
- May 19, 2022 17:19
Hi Mohammed Khalil and Nicholas Moulton - are there particular data sources that this is important for you?

The core of your request is about data consistency. Fivetran today can only _guarantee_ eventual consistency. For database connectors using a log-based replication method we _effectively_ have snapshot consistency at the end of a sync. You can take advantage of this with our new Integrated Scheduling feature, which will orchestrate downstream transformations after the sync is finished.

For API connectors, improving consistency is harder than you would think. We've found that many API's are eventually consistent if they are abstracting a distributed system. We call this data integrity error "late arriving events" & the only solution is to re-sync time slices repeatedly to catch these events. Because we only charge on active primary keys this behavior has no impact on your pricing.

To your suggestion - we could limit how recent a record we sync, but that comes at a cost of increased overall latency. Our general recommendation is run the connector more frequently to reduce the inconsistency.
Nicholas Moulton User
- April 05, 2022 18:07
Our company has run into the same issue. It would be ideal to have referential integrity supported for the connectors but I can see where that would be very difficult to develop.

In place of that having a priority list to flag specific tables as "primary" tables would be simple and in most cases just as effective.
Mohammed Khalil User
- May 26, 2022 20:08
Hi Fraser

Yes running the sync more frequently would reduce the prevalence of this issue, but still, it will still occur if a lot of data is being added to both tables whilst the sync is ongoing
Vinay Rana User
- November 30, 2022 18:03
It is quite impossible to think of a data ingestion pipeline which doesn't allow user to have the flexibility of picking up any time. Instead some fixed interval.

please add this feature ASAP. Take a sample use case -

I have more than 150 Instagram business account, I want to pull data from all these 150 business account every hour. This data contains media stats, likes, comments, share, saves etc.

The way Instagram graph API works - It pulls on point data of the time when API call was made. I have board member who wants to see attribution model for all these stats every hour. But glad, I can't do that because FiveTran is pulling data any minute of the hour, so when I want to see attribution model between 10:00 PM to 11:00 PM , I can't do that because my data is actually pulled randomly instead first minute of the hour.
Nicholas Moulton User
- November 30, 2022 22:27
@...

Our use case is more around referential integrity. A simple table order sync value would work well or possibly marking tables with parent/child relationships.

In our case we have several tables like the parent with orders/customers where it is not ideal to have a report show up with new orders and have no idea what customers are attached. As you mentioned you can kick this up with tighter sync windows but given the size and activity on those tables this isn't ideal.