Destination Improvement: Optimize Redshift Load step by adhering to guidelines on # of files used per COPY
When loading data into Redshift using the COPY command, AWS recommends that data is split into multiple files:
Long running COPY commands which load fewer files than the total number of cluster nodes (5) have the largest opportunity for improvement and should be optimized first. Any COPY which loads fewer files than the number of cluster slices (64) should be optimized next.
We have a cluster with 5 nodes, but most (all?) Fivetran COPY commands use only 4 files or less. Many only use 1 file. Long running COPY commands apparently would benefit from data being split into 5 or even 64 files to properly benefit from Redshift's massively parallel processing (MPP) architecture.
Ideally Fivetran would optimize the number of files used per COPY automatically, but it could also be useful to be able to configure this on the user side.
Here is the redshift query I'm using to determine the # of files used for Fivetran COPY commands (adapted from Redshift AWS console).
SELECT query, MAX(starttime), COUNT(*) num_files,
ROUND(MAX(wq.total_exec_time/1000000.0),2) max_execution_secs,
ROUND(MIN(wq.total_exec_time/1000000.0),2) min_execution_secs,
ROUND(SUM(transfer_size)/(1024.0*1024.0),2) total_mb,
SUBSTRING(querytxt,1,240) copy_sql
FROM stl_s3client s
JOIN stl_query q USING (query)
JOIN stl_wlm_query wq USING (query)
WHERE s.userid>1 AND http_method = 'GET'
AND POSITION('COPY ANALYZE' IN querytxt) = 0
AND aborted = 0 AND final_state='Completed'
AND querytxt LIKE '%Fivetran%'
GROUP BY query, querytxt
HAVING (SUM(transfer_size)/(1024*1024))/COUNT(*) >= 2
ORDER BY CASE
WHEN COUNT(*) < (SELECT max(node)+1 FROM stv_slices) THEN 1
WHEN COUNT(*) < (SELECT COUNT(*) FROM stv_slices WHERE node=0) THEN 2
ELSE 2+((COUNT(*) % (SELECT COUNT(*) FROM stv_slices))/(SELECT COUNT(*)::DECIMAL FROM stv_slices))
END, (SUM(transfer_size)/(1024.0*1024.0))/COUNT(*) DESC;
Please sign in to leave a comment.
Comments
0 comments