GaussDB(DWS)应急预案(二):动态内存不足/memory is temporarily unavailable
1.1 动态内存高/memory temporarily unavailable(预计10min)
1.1.1 应急步骤
- 查看各节点动态内存占用,找到动态内存占用高的实例:
set disable_memory_protect = on; --可以防止系统表查询报错
Select a.nodename,a.memorymbytes as dynamic_used_memory,b.memorymbytes as max_dynamic_memory, dynamic_used_memory/max_dynamic_memory*100 as used_rate from pgxc_total_memory_detail a join pgxc_total_memory_detail b on a.nodename=b.nodename where a.memorytype = 'dynamic_used_memory' and b.memorytype = 'max_dynamic_memory' order by a.memorymbytes desc;
- 连接内存占用高的实例,查看空闲连接内存占用:
select c.state,pg_size_pretty(sum(totalsize)) from pv_session_memory_detail p inner join pg_stat_activity c on c.pid = split_part(p.sessid,'.',2) group by 1;
- 连接内存占用高的实例,查看各语句内存占用,拼接查杀命令,应急将内存占用高的语句查杀:
select 'select pg_terminate_backend(' || pid || ');' , state, query_id, pg_size_pretty(totalsize) total, pg_size_pretty(freesize) free, pg_size_pretty(usedsize) used, sqlstr FROM (SELECT state, query_id, pid, sum(totalsize) totalsize, sum(freesize) freesize, sum(usedsize) usedsize, substr(query,1,100) sqlstr FROM pv_session_memory_detail, pg_stat_activity WHERE pid=split_part(sessid,'.',2) GROUP BY 1,2,3,7 ORDER BY 4 DESC);
- 连接内存占用高的实例,查看各线程memory context情况,并拼接查杀命令:
select 'select pg_terminate_backend(' || split_part(sessid,'.',2) || ');', split_part(sessid,'.',2) pid,pg_size_pretty(sum(totalsize)) total_size,count(*) context_count from pv_session_memory_detail group by pid order by sum(totalsize) desc limit 100;
- 根据实时topsql查看动态内存使用最高的语句,并拼接查杀命令:
SELECT 'execute direct on(' || nodename || ') ''select pg_cancel_backend(' || pid || ')'';', username, start_time, duration, max_peak_memory, substr(query,1,100) from pgxc_wlm_session_statistics order by max_peak_memory desc;
- 连接CN,清理空闲连接
select * from pgxc_clean_free_conn;
- 若上述手段均无法恢复,应急kill内存占用高的实例快速恢复。
1.1.2 恢复确认
重新连接CN,查看各实例动态内存使用,确保used_rate降低至90%以下。
1.1.3 原因分析
- 查看各节点动态内存占用,找到动态内存使用高的CN/DN实例:
Select a.nodename,a.memorymbytes as dynamic_used_memory,b.memorymbytes as max_dynamic_memory, dynamic_used_memory/max_dynamic_memory*100 as used_rate from pgxc_total_memory_detail a join pgxc_total_memory_detail b on a.nodename=b.nodename where a.memorytype = 'dynamic_used_memory' and b.memorytype = 'max_dynamic_memory' order by a.memorymbytes desc;
- 执行如下命令,找到对应实例的节点和端口号,gsql登录该实例:
cm_ctl query -Cvp | grep xxxx
- 执行如下命令,确认当前active和idle连接的内存使用量:
select c.state,pg_size_pretty(sum(totalsize)) from pv_session_memory_detail p inner join pg_stat_activity c on c.pid = split_part(p.sessid,'.',2) group by 1;
- 查看各个状态的内存使用总量,结果示例如下:
- 如果idle状态的内存占用高,需要通过如下方式手动清理空闲连接(不影响业务):
select * from pgxc_clean_free_conn;
- 如果active状态内存占用高,需要继续查看各语句的内存占用:
select 'select pg_terminate_backend(' || pid || ');' , state, query_id, pg_size_pretty(totalsize) total, pg_size_pretty(freesize) free, pg_size_pretty(usedsize) used, sqlstr FROM (SELECT state, query_id, pid, sum(totalsize) totalsize, sum(freesize) freesize, sum(usedsize) usedsize, substr(query,1,100) sqlstr FROM pv_session_memory_detail, pg_stat_activity WHERE pid=split_part(sessid,'.',2) and usename not in ('omm', 'Ruby') GROUP BY 1,2,3,7 ORDER BY 4 DESC);
- 执行结果如下:
- 与业务确认后,应急查杀内存占用高的语句。
- 若业务上有upsert入库场景,且upsert由于主键冲突频繁触发报错,可能会导致相关线程短时间内内存冲高,可通过如下方式排查:
select 'select pg_terminate_backend(' || split_part(sessid,'.',2) || ');', split_part(sessid,'.',2) pid,pg_size_pretty(sum(totalsize)) total_size,count(*) context_count from pv_session_memory_detail group by pid order by sum(totalsize) desc limit 100;
- 同时可以查看相关的实例日志,确认是否有大量报错(xxxx替换成实际的实例编号,日志时间换成实际的日志时间):
cd $GAUSSLOG/pg_log/xxxx
zgrep 'ERROR: ' postgresql-2023-11-11* | wc –l
- 若日志中有大量报错,需要根据报错内容确认报错原因,减少相关报错(如常见的主键冲突、违反非空约束等场景)
zgrep 'ERROR: ' postgresql-2023-11-11* | more
- 出现memory is temporarily unavailable的报错时,实例日志中会打印相关信息。根据报错的实例日志,在日志中搜索“abnormal”关键字找到当时内存使用最高的语句,找到对应的thread id,再根据thread id找到对应的query id(注意,非日志中的debug_query_id)。
- 示例:如下日志打印中,根据abnormal关键字找到对应的thread id实际消耗内存为13G,该id对应的query id为76279718688746485,对应的语句为explain performance with xxx。
----debug_query_id=76279718689098154, The abnormal query thread id 140664667547392.It current used memory is 13618 MB and estimated memory is 1102 MB.It also is the query which costs the maximum memory.
----debug_query_id=76279718689098154, It is not the current session and beentry info : datid<16389>, app_name<cn_5001>, query_id<76279718688746485>, tid<140664667547392>, lwtid<173496>, parent_tid<0>, thread_level<0>, query_string<explain performance with …>.
- 对于事后分析的场景,还可以根据历史topsql中max_peak_memory字段找到问题时间点内存占用高的语句,max_peak_memory表示实际内存占用高的语句:
select * from pgxc_wlm_session_info where start_time > 'xxxx-xx-xx xx:xx:xx' and start_time < 'xxxx-xx-xx xx:xx:xx' order by max_peak_memory desc limit 100;
1.1.4 预防再次出现
- 根据最佳实践,打开动态资源管理,配置内存熔断方案,设置作业级/用户级/资源池级熔断参数
- 定时清理空闲连接
- upsert避免频繁报错问题
- 若总内存较小,需要考虑提升内存规格
- 点赞
- 收藏
- 关注作者
评论(0)