Connector Improvement: S3 connector: Honor Parquet schema types instead of inferring from values
AnsweredCurrent behavior:
The S3 connector ignores Parquet schema type declarations and instead infers column types from observed values. For example, a column declared as VARCHAR in the Parquet schema will be created as BIGINT in the destination if the observed values happen to be numeric. This behavior was confirmed as intentional by Fivetran support (ticket #331726).
Problem:
This creates two issues:
1. Silent data corruption - Identifier fields (IINs, zip codes, account numbers) can have leading zeros. When inferred as numeric, "034567" becomes 34567. Any records with leading zeros ingested before a type promotion are permanently corrupted.
2. Non-deterministic schema - The destination type depends on which values Fivetran happens to process first, not on the source schema. The same source data can produce different destination schemas depending on row ordering.
Requested behavior:
When a source schema is available (Parquet, Avro, etc.), use the declared types. Fall back to inference only for schema-less formats (CSV, JSON without schema).
Use case:
We ingest Parquet files from a payment processor. The source declares identifier columns as VARCHAR because they can have leading zeros. The S3 connector inferred BIGINT because our data happened to not contain leading zeros yet.
Support resolved this for our specific columns, but the underlying behavior affects any VARCHAR column that contains only dig its.
-
Official comment
Hi Dan,
Thank you for filing this request. Honouring Parquet data types in our file connectors is on our short term roadmap. We will post further updates here as we get closer to a release.
Best,
Parmeet
Please sign in to leave a comment.
Comments
1 comment