Skip to main content

Community

Destination Improvement: Default Clustering in BigQuery for Common Connectors

Answered

Please sign in to leave a comment.

Comments

2 comments

  • Official comment

    Hi Justin,

    Thanks for the request! Kevin from the product team here.

    As you mention, we recently added support for clustered tables for BigQuery. We don't currently automatically create clusters, as we typically do not know how users want to partition or cluster their datasets.

    In general, our primary product principles reinforce ease of use and one predictable, default choice. As much as possible, if we were to allow users to set clustering options from our UI, we would want a simple interface with minimal choices presented in the front end.

    I agree that most customers should cluster, but our engineering team needs to research what the default might be in this scenario. We are currently leaning towards providing documentation on best practices for BigQuery warehouse management, but not making it configurable from the BigQuery setup form.

    Let me know your feedback on this! It could definitely be valuable but I would love to hear your thoughts on how we could automate this.

    Kevin Kim Thanks for the response here. We've used a lot of platforms and also written our own integrations, so I get that a lot of the details here aren't trivial when it comes to systems design.

    I think part of what has to be weighed is the trade-offs involved and where the points of demarcation are within the user experience—which is also closely tied to user expectation. I also agree that users should have the ability to choose their own strategies if they have the expertise to do so.

    A few thoughts on this that I hope might be helpful as an outside perspective:

    1. You mentioned "one predictable, default choice" which I wholeheartedly agree with—but UI and architecture decisions don't happen in a vacuum. If you expand that predictability to encompass the full user experience, things like warehouse costs (where the current default is unpredictable) and how competing platforms work come into play as well.

      This is one of the concerns I've seen come up several times in the #Measure Slack community among new users who demo Fivetran—cost, not just from the platform but in BigQuery. Some of that that is because they're used to append-only platforms that don't incur the query I/O Fivetran does when MERGE queries run. For others, they don't know enough about BigQuery to be able to optimize for costs—they expect those things to be managed for them.

    2. For teams where engineering assets aren't the one maintaining the Fivetran integrations, this also presents a workflow challenge. What does that workflow look like to control costs? I would assume marketer sets up the integration, pauses it, data engineer determines optimal schema for each of tens or hundreds of tables (without seeing data?) and sets up the appropriate partitioning and clustering, notifies marketer, who re-enables integration, then everybody waits for the backfill to finish to be able to test if the performance characteristics are within bounds.

      . . . except that in our onboarding we were all told not to pause or mess with connectors during the historical backfill or we may end up getting charged MAR for them. But since some large backfills could take weeks or months, it's feasible to see costs of hundreds or thousands of dollars accumulate from MERGE statements against those large tables before tables can be updated to control those costs.

      Then a new table gets added and backfilled. Or another client needs to be set up. It's a time-consuming process to manage manually, especially when you have data engineers who may not be familiar with the schemas the synced systems use.

    3. On that note, because BigQuery has very few knobs and dials to turn, it's attracted a crowd of people who either don't need to or don't know how to manage the underlying data efficiently (especially now that there are many more options between expanded clustering options and time-unit partitioning). I've seen a lot of folks with tons of experience in other platforms do very foolish things in BigQuery.

      Now, generally, that shouldn't be Fivetran's problem—but with marketing like "Focus on analytics, not engineering", you're bound to get marketers and analysts using the platform who don't have the experience necessary to control costs, so those folks will walk away complaining about it and not realize there was ever a way to fix it. It's been said that "marketing is a promise"—so this is one of those areas that user expectation has to factor into the default system design.

      And just because I'm comfortable managing schema doesn't mean the client that I might refer to Fivetran is (I can't name a single one of ours that would have the expertise on staff). If their use-case is BigQuery for analysis for visualization through Data Studio, they're just plain unlikely to have needed any deep knowledge of the platform.

      So documentation of best practices is nice, but it doesn't really resolve those realities for users. And it doesn't give that predictable, turnkey solution your marketing is targeted at.

    At the end of the day it seems like y'all actually have people who really know BigQuery (unlike most ETL platforms out there!). You also know the schemas being used, which the data engineer working with the data won't. I think that puts you in a position to use that expertise to improve both the user experience and minimize costs and efforts for your users and maximize their satisfaction with the product.

    From an automation standpoint, because every connector has a defined schema, you already know exactly how that data is merged and the types of keys used. You probably also have a good concept of use cases that would represent common userland queries. From there, a default clustering/partitioning strategy could be formed that improves costs/performance for the majority of users (whether enabled by default or not). Trade-offs could be made to prioritize compatibility over perf/cost (like when date partitioning is involved, using a time unit like Week or Month to avoid any chance of hitting the partition limit—but at least you still get the benefit of long-term storage for most data). This could be rolled out based on the popularity of connectors, or prioritize those that tend to have larger amounts of data and would benefit from it the most. Those defaults could then be documented on the ERD and connector details. Going forward, this could become part of the connector development and schema design process so people can benefit from it on day one. (I imagine you'd get less complaints from something your team designs than people concerned about warehouse costs or who break connectors trying to get this right.)

    There's a great marketing aspect here too—with the existing logging support, you should be able to quantify what it's saving users both in terms of cost and query time if you get to the point where you trial this. From my experience clustering the first couple of our tables that couldn't be partitioned previously, the results for both cost and speed would be pretty significant for normal use-cases.

    I'm pretty realistic about the considerations and complexities of features like this (I've worn the developer, data architect, and analyst hats—so I know the pain is real!) . . . but this is something I've seen SO many other platforms get wrong that I'm cheering y'all along to hopefully set a better precedent as you've done in many other areas.

    Always happy to connect further if it's of any help to you or the team there.