Skip to main content

Community

Boundary Slicing enhancement

Answered

Please sign in to leave a comment.

Comments

2 comments

    Thanks, Saif, 

    This is a thoughtful suggestion, and we’ll review it together with engineering.

    I agree the underlying problem is real: choosing an effective slicing strategy for very large tables is not always straightforward, and customers often default to modulo because it is simple and predictable. That said, a generic deterministic slicing approach across sources would likely be expensive to build and maintain, since the best strategy depends heavily on source-specific factors such as data distribution, available statistics, indexing, and the cost of sampling or scanning large tables. In practice, equal record counts also do not always translate into equal execution time.

    We’ll take a closer look at the feasibility and whether there is a practical way to improve this area, potentially starting with better guidance or recommendations before considering a broader automated solution.

    Best regards,

    Edwin

    Hi Edwin,

    Thanks for taking the time to read my suggestion and take this to engineering.

    Yes, You’re right. It’s not   simple to do deterministic slicing across different data sources. Statistics quality and sampling, Indexing, data-skew may affect  the data replication aspect. Just splitting things up evenly by row count definitely doesn’t mean you’ll get equal execution times.

    My goal is to ensure right slicing strategy is proposed and leverage HVR or similar tool to look for  approaches beyond just modulo slicing as a one-size-fits all solution . I just wondered if letting slice boundaries adapt based on the source could help with  (avoid) repeated scans and make query ordered across range access more efficient especially on really big tables, under the right conditions. A few sampling techniques can be incorporated in quick succession to calculate the math behind the boundary slicing ranges for highly transactional tables.

    I put together a technical write-up covering what I noticed, some tradeoffs, and a possible approach in more detail:

    https://medium.com/@smsgoonersarfraz/is-your-data-replication-for-large-systems-running-slow-why-does-it-breakdown-at-scale-08fa9bd789ab

    The idea really came from observing multi-billion-row replication workloads replicated using  modulo-based slicing, the  source side table scan overhead recurred  whether it was one month restrict-refresh or 2 year restrict-refresh condition. I used a very large SQL server table  as source with a clustered key indexed on 4 integer columns where the left most key was used for modulo. The target location is snowflake 

    There might be a good chance here to look into things like:

    •  boundary heuristics that respond to real stats of the data distribution/density at the given point of time.
    •  slice refinement based on some sampling techniques
    •  hot-zone detection,(data skew or multiple duplicates seen on your boundary column)
    •  or even just better guidance for customers with huge tables.

    If your team wants to dig deeper into this, I’d love to help brainstorm, run some benchmarks, or figure out adaptive slicing strategies together.

    Thanks for being open to new ideas.

    Best, 

    Saif