Connector Improvement: Limit number of records per day from a given data source -> useful in test environmentsAnswered
As a data engineer, I want to build ETL pipelines in my DEV and QA environments before launching in PROD. Given the size of some of these data sources, I would like to limit the number of records extracted from a data source per day – while syncing the entire schema / changes to that schema.
For example – say I'm extracting from a MongoDB database to my data warehouse.
- Say the schema changes every 24 hours (wild scenario, but hey).
- Given these frequent schema changes, I would like to create some (internal / privately managed) tests in my QA environment which validate that destination schema matches some reference.
- As a consequence, I need at least 1 record to be extracted per day.
- Say this collection grows rapidly ( > 100M records/day).
- I would not like to spend resources on extracting all records in my DEV and QA environments --> but I need at least 1 record per day.
- Currently, to achieve desired outcome, I must sync over these collections entirely.
- Limit the number of records extracted from a given data source per period
- Limit the number of records extracted from a given table per period (can be used in conjunction with above)
- Determine sampling method of extracted records, per period: (Oldest, Newest, Random (per timestamp), Even)
That's a sensible feature idea for dev and QA environments. It's not a short-term roadmap item but I will start probing the demand. If this would benefit your team, please add your voice to the comments and upvote.
One question: is it feasible to sync from a subset of the data in the production database rather than directly from the full database?
Please sign in to leave a comment.