GaussDB(DWS)性能调优:row_number()场景性能优化
【摘要】 特殊场景的row_number()性能优化
很多业务场景里面有类似如下SQL实现获取每组的最大或者最小值的目的
SELECT
s.reconcile_method, s.trans_entity_bank_acc_id
FROM (
SELECT tms.trans_entity_bank_acc_id, tms.source AS reconcile_method
FROM (SELECT
tms.trans_entity_bank_acc_id, bank_st.source,
row_number() OVER(PARTITION BY tms.trans_entity_bank_acc_id ORDER BY bank_st.statement_date DESC) cnt
FROM dwrtretl.dwr_tr_tms_settle_f_tmp6 tms,
(SELECT bank_st.source, bank_st.bank_account_id, bank_st.statement_date
FROM dwnodi.dwr_fin_tr_bank_statement_fi bank_st
WHERE 1 = 1 AND bank_st.del_flag = 'N') bank_st
WHERE tms.trans_entity_bank_acc_id = bank_st.bank_account_id
AND tms.settlement_date <= bank_st.statement_date
) tms WHERE tms.cnt = 1
)s
如上的SQL语句在执行的时候会导致操作的数据集按照PARTITION BY和ORDER BY指定的组合字段做排序,具体计划如下
排序是一个高内存、低性能的操作动作,当数据量很大时,排序动作(Sort算子)会导致极大耗时。上述SQL的执行信息如下,其中Sort算子的耗时占比为(286227.739-83911.249)/287404.461 = 70.39%
因此我们一般应当避免大数据量(超千万级)的排序。当出现因为row_number导致执行执行慢的场景时,一般的优化思路有两个
- 减小row_number操作处理的数据量。适用场景一般是row_number操作的数据源重复度比较高的场景
- 通过等价改写的方式避免排序操作。适用场景一般是输出列和row_number的PARTITION BY和ORDER BY列的子集的场景
根据实际业务SQL的满足情况,优先检视建议方案1
场景一:通过提前去重操作,降低row_number操作的数据量
原始SQL
SELECT
s.reconcile_method, s.trans_entity_bank_acc_id
FROM (
SELECT tms.trans_entity_bank_acc_id, tms.source AS reconcile_method
FROM (SELECT
tms.trans_entity_bank_acc_id, bank_st.source,
row_number() OVER(PARTITION BY tms.trans_entity_bank_acc_id ORDER BY bank_st.statement_date DESC) cnt
FROM dwrtretl.dwr_tr_tms_settle_f_tmp6 tms,
(SELECT bank_st.source, bank_st.bank_account_id, bank_st.statement_date
FROM dwnodi.dwr_fin_tr_bank_statement_fi bank_st
WHERE 1 = 1 AND bank_st.del_flag = 'N') bank_st
WHERE tms.trans_entity_bank_acc_id = bank_st.bank_account_id
AND tms.settlement_date <= bank_st.statement_date
) tms WHERE tms.cnt = 1
)s
优化后SQL
SELECT
s.reconcile_method, s.trans_entity_bank_acc_id
FROM (
SELECT
trans_entity_bank_acc_id, source AS reconcile_method,
row_number() OVER(PARTITION BY trans_entity_bank_acc_id ORDER BY statement_date DESC) AS rn
FROM (
SELECT
tms.trans_entity_bank_acc_id, bank_st.source, bank_st.statement_date
FROM dwnodi.dwr_fin_tr_bank_statement_fi bank_st, dwrtretl.dwr_tr_tms_settle_f_tmp6 tms
WHERE tms.trans_entity_bank_acc_id = bank_st.bank_account_id
AND tms.settlement_date <= bank_st.statement_date AND bank_st.del_flag = 'N'
GROUP BY tms.trans_entity_bank_acc_id, bank_st.source, bank_st.statement_date
)t
)s WHERE rn = 1
优化后的执行情况
场景二:反向关联,降低row_number操作的数据量
优化后SQL
WITH t AS( /* row_number要处理的结果集 */
SELECT
tms.trans_entity_bank_acc_id, bank_st.source, bank_st.statement_date
FROM dwnodi.dwr_fin_tr_bank_statement_fi bank_st, dwrtretl.dwr_tr_tms_settle_f_tmp6 tms
WHERE tms.trans_entity_bank_acc_id = bank_st.bank_account_id
AND tms.settlement_date <= bank_st.statement_date AND bank_st.del_flag = 'N'
GROUP BY tms.trans_entity_bank_acc_id, bank_st.source, bank_st.statement_date
),
m AS( /* 求满足rn = 1的PARTITION BY列和ORDER BY列 */
SELECT
trans_entity_bank_acc_id, /* PARTITION BY列 */
max(statement_date) AS statement_date /* ORDER BY列 */
FROM t
GROUP BY trans_entity_bank_acc_id
)
SELECT
s.reconcile_method, s.trans_entity_bank_acc_id
FROM (
SELECT
trans_entity_bank_acc_id, source AS reconcile_method,
row_number() OVER(PARTITION BY trans_entity_bank_acc_id ORDER BY statement_date DESC) AS rn
FROM ( /* 跟原始结果集反关联,获取row_number()要处理的最小结果集 */
SELECT t.*
FROM t
INNER JOIN m ON m.trans_entity_bank_acc_id = t.trans_entity_bank_acc_id
AND m.statement_date = t.statement_date
)f
)s WHERE rn = 1
优化后的执行情况
【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)