GaussDB长事务分析
- 故障现象
长事务在数据库的表现是session持续时间长,期间可能伴随cpu、内存升高,严重可导致数据库整体响应缓慢,业务无法正常运行。
- 故障原因
可能原因有以下:
- 大量的锁竞争。
- 执行了比较耗时的SQL。
- 处理方法
步骤 1 登录CN节点,gsql连接数据库。
gsql -d postgres -p 8000 -U user -W password -r
说明:
集中式实例登录主DN节点。
步骤 2 查询正在运行的SQL语句详细信息,runtime表示SQL语句运行的时间,超过阈值时间或者当前语句执行时间超过预期,即可认为是长事务,以下统称长事务。
分布式:
execute direct on all 'select current_timestamp - query_start as runtime,datname,usename,sessionid,substr(query,0,100) from pg_stat_activity where state != ''idle'' and datname in(''$database'') and usename in (''$user'') and extract(epoch from current_timestamp-xact_start)/60 > 1 order by 1 desc;';
集中式:
select current_timestamp - query_start as runtime,datname,usename,sessionid,substr(query,0,100) from pg_stat_activity where state != 'idle' and datname in('$database') and usename in ('$user') and extract(epoch from current_timestamp-xact_start)/60 > 1 order by 1 desc;

如上图,查询到sessionid=52246的长事务。
说明:
1. $user和$database需根据实际情况修改。
2. extract(epoch from current_timestamp-xact_start)/60 > 1表示查询执行时间大于1min的SQL,可根据实际情况修改。
步骤 3 查看数据库中正在运行的等待事件。
分布式:
execute direct on all 'select * from pg_thread_wait_status where sessionid in(select sessionid from pg_stat_activity where state != ''idle'' and datname in(''$database'') and usename in (''$user'') and extract(epoch from current_timestamp-xact_start)/60 > 1);';
集中式:
select * from pg_thread_wait_status where sessionid in(select sessionid from pg_stat_activity where state != 'idle' and datname in('$database') and usename in ('$user') and extract(epoch from current_timestamp-xact_start)/60 > 1);

如上图,查询到sessionid=52246的长事务的等待事件。
说明:
1. $user和$database需根据实际情况修改,建议和步骤2中的条件保持一致。
2. extract(epoch from current_timestamp-xact_start)/60 > 1表示查询执行时间大于1min的SQL,可根据实际情况修改,建议和步骤2中的条件保持一致。
步骤 4 查询长事务被阻塞原因。
分布式:
execute direct on all 'select * from pg_thread_wait_status where sessionid in(select block_sessionid from pg_thread_wait_status where sessionid in(select sessionid from pg_stat_activity where state != ''idle'' and datname in(''$database'') and usename in (''$user'') and extract(epoch from current_timestamp-xact_start)/60 > 1));';
集中式:
select * from pg_thread_wait_status where sessionid in(select block_sessionid from pg_thread_wait_status where sessionid in(select sessionid from pg_stat_activity where state != 'idle' and datname in('$database') and usename in ('$user') and extract(epoch from current_timestamp-xact_start)/60 > 1));

如上图,sessionid=52241的会话阻塞了步骤2中sessionid=52246的长事务。
说明:
1. $user和$database需根据实际情况修改,建议和步骤2中的条件保持一致。
2. extract(epoch from current_timestamp-xact_start)/60 > 1表示查询执行时间大于1min的SQL,可根据实际情况修改,建议和步骤2中的条件保持一致。
若未查询到数据,则表明该长事务不是被阻塞其他线程阻塞,可选择结束长事务或者优化SQL。
a、结束长事务,可执行以下SQL:
分布式:
execute direct on all 'select pg_terminate_session(pid,sessionid) pid,sessionid from pg_stat_activity where state != ''idle'' and datname in(''$database'') and usename in (''$user'') and extract(epoch from current_timestamp-xact_start)/60 > 1 order by 1 desc;';
集中式:
select pg_terminate_session(pid,sessionid) pid,sessionid from pg_stat_activity where state != 'idle' and datname in('$database') and usename in ('$user') and extract(epoch from current_timestamp-xact_start)/60 > 1 order by 1 desc;
说明:
1. $user和$database需根据实际情况修改,建议和步骤2中的条件保持一致。
2. extract(epoch from current_timestamp-xact_start)/60 > 1表示查询执行时间大于1min的SQL,可根据实际情况修改,建议和步骤2中的条件保持一致。
须知:强制结束SQL语句为风险操作,执行前需与客户及华为技术确认。
b、优化SQL,可查询执行计划后参照单SQL性能慢--计划分析优化该语句。
explain $query;
说明:
$query为步骤2中获取的语句。
若查询到数据,则表明长事务被其他线程阻塞,继续步骤5。
分布式:
execute direct on all 'select pid,sessionid,substr(query,0,100) from pg_stat_activity where sessionid in(select sessionid from pg_thread_wait_status where sessionid in(select block_sessionid from pg_thread_wait_status where sessionid in(select sessionid from pg_stat_activity where state != ''idle'' and datname in(''$database'') and usename in (''$user'') and extract(epoch from current_timestamp-xact_start)/60 > 1)));';
集中式:
select pid,sessionid,substr(query,0,100) from pg_stat_activity where sessionid in(select sessionid from pg_thread_wait_status where sessionid in(select block_sessionid from pg_thread_wait_status where sessionid in(select sessionid from pg_stat_activity where state != 'idle' and datname in('$database') and usename in ('$user') and extract(epoch from current_timestamp-xact_start)/60 > 1)));

如上图,查询出阻塞长事务的SQL语句,其sessionid=52241。
若数据库整体性能劣化严重,需要紧急处理本步骤中的SQL语句,即强制结束该SQL语句:
分布式:
execute direct on all 'select pg_terminate_session(pid,sessionid) pid,sessionid from pg_stat_activity where sessionid in(select sessionid from pg_thread_wait_status where sessionid in(select block_sessionid from pg_thread_wait_status where sessionid in(select sessionid from pg_stat_activity where state != ''idle'' and datname in(''$database'') and usename in (''$user'') and extract(epoch from current_timestamp-xact_start)/60 > 1)));';
集中式:
select pg_terminate_session(pid,sessionid) pid,sessionid from pg_stat_activity where sessionid in(select sessionid from pg_thread_wait_status where sessionid in(select block_sessionid from pg_thread_wait_status where sessionid in(select sessionid from pg_stat_activity where state != 'idle' and datname in('$database') and usename in ('$user') and extract(epoch from current_timestamp-xact_start)/60 > 1)));

如上图,pid=t,表示结束会话成功。
说明:
1. $user和$database需根据实际情况修改,建议和步骤2中的条件保持一致。
2. extract(epoch from current_timestamp-xact_start)/60 > 1表示查询执行时间大于1min的SQL,可根据实际情况修改,建议和步骤2中的条件保持一致。
须知:强制结束SQL语句为风险操作,执行前需与客户及华为技术确认。
若当前不紧急或者性能劣化不严重,可先执行步骤6。
步骤 6 查看步骤5中SQL语句的执行计划,参照单SQL性能慢--计划分析优化该语句。
explain $query;
说明:
$query为步骤5中获取的语句。
步骤 7 若以上均不涉及,请联系华为技术支持。
----结束
- 点赞
- 收藏
- 关注作者
评论(0)