Community

More compact data for cloud functions connector than JSON

Answered

Chen User

January 29, 2021 02:52

Google Cloud has a 10MB limit on data returned. JSON is not the most compact data format, and there's potentially some easy wins here.

The simplest thing to implement on your side is probably to just accept a buffer representing a compressed JSON string. Going beyond that, accepting some sort of binary encoding with it'a schema embedded in the message (e.g. avro) would be good too.

Please sign in to leave a comment.

Comments

4 comments

Official comment

Fraser User
- February 24, 2021 22:10
Hi All - `has_more` is not just a workaround - it's our best practices for building reliable data connectors. It forces you to break up the extraction into smaller steps & allows the connector to frequently save progress. The saved progress enables fast failure recovery. Internally Fivetran is doing the same operations for our own connectors.

That being said, I understand that there are some data sources where doing one long extraction is much more efficient than many smaller ones. This is often the case for databases with an on-disk heap sorting. ie: Postgres. We have use many tricks in database extraction to avoid these single long queries at all costs. We are considering improvements to Functions later in the year to make it easier to develop them and enable more throughput. We will keep your feedback in mind!
Tom Klein User
- February 01, 2021 18:05
We faced a similar problem and solved it by using the `has_more` feature of the cloud function connector

We then simply divide our data up, and keep returning has_more along with response chunks until the response is complete
Alexander User
- February 09, 2021 01:32
Hi, product manager here! Thanks for the clear description and smart suggestions. Another thought is that Fivetran could host the function ourselves to bypass the limit. This is not on our roadmap today but that can change with more inputs. How are you trying to use Google Cloud when you hit this limit and have you found any workarounds to be viable yet?
Chen User
- February 10, 2021 21:53
Hi Alexander,

There is not workaround, I've been doing what Tom suggested of using "has_more". But given data volume, I'd much rather batch my queries to push more data through. It's more efficient to have 3 cloud function calls return 100k rows of data each than 10 function calls return 30k rows of data each.

> How are you trying to use Google Cloud when you hit this limit

We're simply grabbing rows out of our Spanner database with a cloud function and formatting it per your docs.

Another idea I have is allowing for parallel cloud function calls, with a partition index. Spanner, BigTable, etc all horizontally scale. If in addition to `state`, you can call my cloud function with a partition index, and e.g. with 8 configured partitions let us handle 8 concurrent calls, one for each partition, we can also juice our data throughput.