Community

Connector Improvement: Option for Using a Primary Key in File Connectors to MERGE data

Completed

John Means User

January 22, 2021 20:40

File connectors currently have no way to infer the primary key from the file itself and use that to intelligently de-duplicate and merge data into the destination. As a result, our tables have many duplicate rows and none of the workaround solutions are ideal. What has been considered:

1. Create a View to Pull Distinct IDs Using max(_fivetran_synced)

This causes latency every time the view is queried by downstream applications (ex: Looker report).

2. Re-Create Table Every Time New Data Is Loaded

This becomes increasingly slow and expensive as our table continues to get bigger and bigger.

3. Use a MERGE statement in the destination into a separate "prod" table

This is not scalable, as the MERGE statement will fail if new columns are added to the table.

Proposal: Allow users to define a primary key column in the file itself and Fivetran automatically uses that to handle the deduplication and performing UPDATES to existing rows in the destination table.

Please sign in to leave a comment.

Comments

4 comments

Official comment

Kevin Kim Fivetranner
- February 12, 2021 20:09
Hi John and Britt,

Thanks for your comments here! It is definitely not ideal to have duplicate rows or too many tables in your destination warehouse.

We can investigate a few possible solutions to this, I wonder if we can either improve how we infer the PK, or as mentioned, allow users to specify it directly (possibly either in the file or via the UI).

However, minimum configuration is one of our key principles, so while we discuss this internally, we want to keep the best user experience in mind. I understand this is causing a pain point and we will look into this to provide the best solution for our users.

Please let me know if you have further comments or questions!
Britt User
- January 25, 2021 23:56
Strongly agree with John. Understand that no primary key probably proves challenging to create the connector but I have otherwise found this Box connector not helpful.

This would have been an awesome solution for my company, since we receive data from 3rd parties via csv (automated) and have since had to insert manually which is a huge cost on resources over time that could be utilized doing something more productive than inserting data. And, for better or worse, we send 3rd party vendors without an API (or something else) to Box to get us close to a somewhat automated solution.

Point is, this was so close from relieving us from a very painful and menial task - and was surprised to see upon connecting 100s of separate tables dropped in our warehouse.
Bart Mosley User
- June 03, 2021 14:20
This feature would help our scenario:

- we receive ftp uploads of a csv file containing updates for a dataset

- each file has a unique name (prefixed with timestamp)

- the contents of the file are consistent with columns in the csv file being suitable to construct a multi-column primary key

- we want the destination database updated via an upsert on the multi-column primary key.

Right now I think our only option is to construct a custom function.
David Tsai User
- June 15, 2022 17:52
Running into this same issue as well. Without this, don't think we are able to use Fivetran for our problem.