Skip to main content

Community

More compact data for cloud functions connector than JSON

Answered

Please sign in to leave a comment.

Comments

4 comments

  • Official comment

    Hi All - `has_more` is not just a workaround - it's our best practices for building reliable data connectors. It forces you to break up the extraction into smaller steps & allows the connector to frequently save progress.  The saved progress enables fast failure recovery.  Internally Fivetran is doing the same operations for our own connectors.

    That being said, I understand that there are some data sources where doing one long extraction is much more efficient than many smaller ones.  This is often the case for databases with an on-disk heap sorting. ie: Postgres. We have use many tricks in database extraction to avoid these single long queries at all costs.  We are considering improvements to Functions later in the year to make it easier to develop them and enable more throughput. We will keep your feedback in mind!

    We faced a similar problem and solved it by using the `has_more` feature of the cloud function connector

    We then simply divide our data up, and keep returning has_more along with response chunks until the response is complete

    Hi, product manager here! Thanks for the clear description and smart suggestions. Another thought is that Fivetran could host the function ourselves to bypass the limit. This is not on our roadmap today but that can change with more inputs. How are you trying to use Google Cloud when you hit this limit and have you found any workarounds to be viable yet?

    Hi Alexander,

    There is not workaround, I've been doing what Tom suggested of using "has_more". But given data volume, I'd much rather batch my queries to push more data through. It's more efficient to have 3 cloud function calls return 100k rows of data each than 10 function calls return 30k rows of data each.

     

    > How are you trying to use Google Cloud when you hit this limit

    We're simply grabbing rows out of our Spanner database with a cloud function and formatting it per your docs.

     

    Another idea I have is allowing for parallel cloud function calls, with a partition index. Spanner, BigTable, etc all horizontally scale. If in addition to `state`, you can call my cloud function with a partition index, and e.g. with 8 configured partitions let us handle 8 concurrent calls, one for each partition, we can also juice our data throughput.

     

Didn’t find what you need?

Contact support