New Connector: Amazon S3 Capture deletes
AnsweredThe Amazon S3 connector (https://fivetran.com/docs/files/amazon-s3) does not replicate file deletions in the source.
We need Fivetran to replicate those file deletions. Doing "full resyncs" of the connector or manually deleting those rows from the destination table is not sustainable. Even worse, production jobs could fail if those tasks are not implemented right away.
As a suggestion, add the _FIVETRAN_DELETED flag to the destination table and set it to TRUE when the file has been removed from the S3 bucket. That would be a good behaviro when the "Modified File Merge" parameter is set to "Overwrite Rows".
-
Official comment
Hi Carlos,
Thank you for sharing, could you tell me more about the work flow you currently have?
Our S3 connector currently checks all files in the targeted drive and syncs any that have been updated more recently than the last sync - the challenge we have is identifying when a file was deleted as it is no longer present in the folder and isn't detected as having been 'updated'.
Can you think of an approach for sharing 'deletes' - what would you think if we asked you to add "DELETE_" to the front of any file you wanted to have processed as a 'delete' rather than an addition?
Best regardsAlison
-
Hi Alison,
Thanks for the question.
Before I send my suggestion, full disclaimer, I am not a developer that interacts directly with S3. I sit on the other side of the FT connector (e.g. on the Datawarehouse side, Snowflake in particular). An actual S3 developer may have other/better ideas.
I also want to mention, in case this adds valuable context, that the "file type" we are using is JSON and here are the parameters of our specific connector:
-
File Reading Behavior: Read all files as json
-
Compression Behavior: Infer file compression based on file extension
-
Error Handling: Fails a file sync if improperly formatted data detected
-
Modified File Merge: Upsert file using file name and line number
I'm not considering CSV use cases or other file types. My request is focused on the replication of JSON data when the JSON file type has been selected.
Based on our interactions with S3 so far, I would suggest that whenever the FT S3 connector stops seeing a key in the S3 bucket (e.g. when it stops seeing a JSON file), the connector communicates that situation to us (on the destination side) by setting the value of a _FIVETRAN_DELETED flag to TRUE.
Whenever that flag is true, we would ignore any value that the key has (e.g. we would ignore the contents of the _DATA field.) I realize that we could not assume that the value in FIVETRAN_DB is the last value in S3 before deletion. Now, if you want to wipe out the contents of the record, I guess that would be acceptable too (e.g. just keep the _FILE field and the _FIVETRAN_DELETED flag populated.) Or if you want to delete the actual record in FIVETRAN_DB, that could be acceptable too (that would send the message) although that is not how most FT connectors work.
Those are just a few suggestions, but:
- If I am missing something about S3 that would hinder doing this, I could stand corrected. As I said, I dont interact with the S3 system directly as a developer.
- If there is a "best practice" document that outlines how should the S3 product be used (on the source side) to avoid deletions, I would be open to reading that and sharing it with my partners on the source side
- If you have better ideas to accomplish this, that would be even better.
What I am experiencing is having to do a "full resync" whenever a file is deleted from S3 and I dont want to burden the development team to make changes to their system (they are quite busy.) Would be great for Fivetran to solve this without having to engage our developers.
-
-
We currently also have this problem.
-
Hi,
We ran into the same need for S3 file removal being reflected in Snowflake too.
In our case, here are the operations on the files (jsonl with .gz compression) in the bucket:
- load new files
- update older files
- delete rows in files
- delete files
The 2 last delete operations type make it impossible to use the S3 connector as it cannot provide us with these operations.
When files are deleted, you could use S3 event notifications so that each time a "s3:ObjectRemoved:Delete" event is detected you remove the data from this file in the destination.
When a row is deleted from a file, this is difficult to detect and as such adding a DELETE+INSERT pattern would be useful here. Meaning:
- Delete the records in the destination that have the same "_file" value as the updated file s3 key.
- Insert all records from the updated file
Best regards
Philippe
Please sign in to leave a comment.
Comments
4 comments