Boundary Slicing enhancement
AnsweredSuggesting an enhancement to HVR: introduce a deterministic boundary-slicing method that can identify variable boundary ranges and produce buckets/slices containing approximately equal record counts.
Given constraints such as network capacity, hardware limits, and available parallelism, this deterministic approach should recommend both the number of slices and the depth of each slice so workloads are distributed as evenly as possible.
A practical way to do this is similar to statistics collection in SQL sever , i.e to sample source tables using optimized SQL to visualize the histogram and through that generate slicing recommendations from the sampled distribution. Today, most HVR customers prefer modulo because of its simple “set-and-forget” ease of configuration setup thus not having the confidence to try other slicing techniques. Let me know if we can have a discussion to discuss the solution as I have a remedy/suggestion to try this out.
Requesting that the engineering team to assess the feasibility on this and improve feature usability by providing customers with insights before they choose an approach, especially as very large table volumes continue to grow exponentially and this is a common problem across the board. ( Food for Thought: 90% of world's data created in last 2 years )
-Saif
-
Thanks, Saif,
This is a thoughtful suggestion, and we’ll review it together with engineering.
I agree the underlying problem is real: choosing an effective slicing strategy for very large tables is not always straightforward, and customers often default to modulo because it is simple and predictable. That said, a generic deterministic slicing approach across sources would likely be expensive to build and maintain, since the best strategy depends heavily on source-specific factors such as data distribution, available statistics, indexing, and the cost of sampling or scanning large tables. In practice, equal record counts also do not always translate into equal execution time.
We’ll take a closer look at the feasibility and whether there is a practical way to improve this area, potentially starting with better guidance or recommendations before considering a broader automated solution.
Best regards,
Edwin
-
Hi Edwin,
Thanks for taking the time to read my suggestion and take this to engineering.
Yes, You’re right. It’s not simple to do deterministic slicing across different data sources. Statistics quality and sampling, Indexing, data-skew may affect the data replication aspect. Just splitting things up evenly by row count definitely doesn’t mean you’ll get equal execution times.
My goal is to ensure right slicing strategy is proposed and leverage HVR or similar tool to look for approaches beyond just modulo slicing as a one-size-fits all solution . I just wondered if letting slice boundaries adapt based on the source could help with (avoid) repeated scans and make query ordered across range access more efficient especially on really big tables, under the right conditions. A few sampling techniques can be incorporated in quick succession to calculate the math behind the boundary slicing ranges for highly transactional tables.
I put together a technical write-up covering what I noticed, some tradeoffs, and a possible approach in more detail:
The idea really came from observing multi-billion-row replication workloads replicated using modulo-based slicing, the source side table scan overhead recurred whether it was one month restrict-refresh or 2 year restrict-refresh condition. I used a very large SQL server table as source with a clustered key indexed on 4 integer columns where the left most key was used for modulo. The target location is snowflake
There might be a good chance here to look into things like:
- boundary heuristics that respond to real stats of the data distribution/density at the given point of time.
- slice refinement based on some sampling techniques
- hot-zone detection,(data skew or multiple duplicates seen on your boundary column)
- or even just better guidance for customers with huge tables.
If your team wants to dig deeper into this, I’d love to help brainstorm, run some benchmarks, or figure out adaptive slicing strategies together.
Thanks for being open to new ideas.
Best,
Saif
Please sign in to leave a comment.
Comments
2 comments