Fivetran connects to all of your supported data sources and loads the data from them into your destination. Each data source has one or more connectors that run as independent processes that persist for the duration of one update. A single Fivetran account, made up of multiple connectors, loads data from multiple data sources into one or more destinations.
System Architecture Overview Diagram
Connectlink
Fivetran connects to your data sources using our connectors. Fundamentally, there are two different types of connectors: push and pull.
Pull connectorslink
Fivetran’s pull connectors actively retrieve, or pull, data from a source. Fivetran connects to and downloads data from a source system at a fixed frequency. We use an SSL-encrypted connection to the source system to retrieve data using a variety of methods: database connections via ODBC/JDBC, or web service APIs via REST and SOAP. In practice, the method or combination of methods is different for every source system.
Push connectorslink
In push connectors, such as Webhooks or Snowplow, source systems send data to Fivetran as events. The push connector pipeline is as follows:
- When we receive the events in our collection service, we first buffer the events in the queue.
- We store the event data as JSON in our cloud storage buckets. (For more information, see our data retention documentation.)
- During the sync, we push the data to your destination.
For more information on how sync works in our push connectors, see our Events documentation.
Ingest and Prepare Datalink
Once the connector process ingests the data query results, Fivetran normalizes, cleans, sorts, and de-duplicates the data. The purpose of the normalization and cleaning is to format the data in the optimal way for the destination. (Learn more about this optimization here.)
The Fivetran philosophy is to make a faithful replication of source data with as few transforms as necessary to make it useful.
Fivetran uses a queue to buffer the incoming source data. When destination load failures are caused by transient errors or destination unavailability, Fivetran’s pipeline doesn’t attempt to retrieve the data from the source that we already have in our queue. This limits the impact of destination outages and improves Fivetran reliability. When we find unprocessed data in the storage queue because of destination load failures, we process the pending queued data first.
During the ingestion process, we retain the buffered data that is encrypted at rest using a secret ephemeral key until we successfully load the data into the destination.
The ingestion processes run in parallel with the preparation and load processes. This strategy ensures that the destination data load process doesn’t block the source data ingestion process.
Load Data into Temporary Data Storagelink
Fivetran outputs the finalized records to a file in a file storage bucket. We encrypt this file with a separate ephemeral key that is known only to the process performing the write. We automatically delete this temporary file after 7 days using an expiration policy on the bucket. The bucket service depends on the destination.
Write Data into Destinationlink
From the temporary data storage, Fivetran copies the file into staging tables in the destination. In the process, we transmit the ephemeral encryption key for the file to the destination so it can decrypt the data as it arrives. Before we write the data into the destination, we update the schema of existing tables to accommodate the newer incoming batch of data. We then merge the data from the staging tables with the existing data present in the destination. Finally, we apply the deletes (if any) on the existing tables. Once we complete the write process, the connector process terminates itself. A system scheduler will later restart the process for the next update.