As a data engineer, I want to build ETL pipelines in my DEV and QA environments before launching in PROD. Given the size of some of these data sources, I would like to limit the number of records extracted from a data source per day – while syncing the entire schema / changes to that schema.
For example – say I'm extracting from a MongoDB database to my data warehouse.
- Say the schema changes every 24 hours (wild scenario, but hey).
- Given these frequent schema changes, I would like to create some (internal / privately managed) tests in my QA environment which validate that destination schema matches some reference.
- As a consequence, I need at least 1 record to be extracted per day.
- Say this collection grows rapidly ( > 100M records/day).
- I would not like to spend resources on extracting all records in my DEV and QA environments --> but I need at least 1 record per day.
- Currently, to achieve desired outcome, I must sync over these collections entirely.
- Limit the number of records extracted from a given data source per period
- Limit the number of records extracted from a given table per period (can be used in conjunction with above)
- Determine sampling method of extracted records, per period: (Oldest, Newest, Random (per timestamp), Even)