GaussDB(DWS)应急预案(二):动态内存不足/memory is temporarily unavailable

举报
Arrow0lf 发表于 2023/12/27 19:36:56 2023/12/27
【摘要】 DWS动态内存使用率高应急预案

1.1 动态内存高/memory temporarily unavailable(预计10min)

1.1.1 应急步骤

  • 查看各节点动态内存占用,找到动态内存占用高的实例:

Select a.nodename,a.memorymbytes as dynamic_used_memory,b.memorymbytes as max_dynamic_memory, dynamic_used_memory/max_dynamic_memory*100 as used_rate  from pgxc_total_memory_detail a join pgxc_total_memory_detail b on a.nodename=b.nodename  where a.memorytype = 'dynamic_used_memory' and b.memorytype = 'max_dynamic_memory' order by a.memorymbytes desc;

  • 连接内存占用高的实例,查看空闲连接内存占用:

select c.state,pg_size_pretty(sum(totalsize)) from pv_session_memory_detail p inner join pg_stat_activity c on c.pid = split_part(p.sessid,'.',2) group by 1;

  • 连接内存占用高的实例,查看各语句内存占用,拼接查杀命令,应急将内存占用高的语句查杀:

select 'select pg_terminate_backend(' || pid || ');' , state, query_id, pg_size_pretty(totalsize) total, pg_size_pretty(freesize) free, pg_size_pretty(usedsize) used, sqlstr FROM (SELECT state, query_id, pid, sum(totalsize) totalsize, sum(freesize) freesize, sum(usedsize) usedsize, substr(query,1,100) sqlstr FROM pv_session_memory_detail, pg_stat_activity WHERE pid=split_part(sessid,'.',2) GROUP BY 1,2,3,7 ORDER BY 4 DESC);

  • 连接内存占用高的实例,查看各线程memory context情况,并拼接查杀命令:

select 'select pg_terminate_backend(' || split_part(sessid,'.',2) || ');', split_part(sessid,'.',2) pid,pg_size_pretty(sum(totalsize)) total_size,count(*) context_count from pv_session_memory_detail group by pid order by sum(totalsize) desc limit 100;

  • 根据实时topsql查看动态内存使用最高的语句,并拼接查杀命令:

SELECT  'execute direct on(' || nodename || ') ''select pg_cancel_backend(' || pid || ')'';', username, start_time, duration, max_peak_memory, substr(query,1,100) from pgxc_wlm_session_statistics order by max_peak_memory desc;

  • 连接CN,清理空闲连接

select * from pgxc_clean_free_conn;

  • 若上述手段均无法恢复,应急kill内存占用高的实例快速恢复。

1.1.2 恢复确认

重新连接CN,查看各实例动态内存使用,确保used_rate降低至90%以下。

1.1.3 原因分析

  • 查看各节点动态内存占用,找到动态内存使用高的CN/DN实例:

Select a.nodename,a.memorymbytes as dynamic_used_memory,b.memorymbytes as max_dynamic_memory, dynamic_used_memory/max_dynamic_memory*100 as used_rate  from pgxc_total_memory_detail a join pgxc_total_memory_detail b on a.nodename=b.nodename  where a.memorytype = 'dynamic_used_memory' and b.memorytype = 'max_dynamic_memory' order by a.memorymbytes desc;

  • 执行如下命令,找到对应实例的节点和端口号,gsql登录该实例:

cm_ctl query -Cvp | grep xxxx

  • 执行如下命令,确认当前active和idle连接的内存使用量:

select c.state,pg_size_pretty(sum(totalsize)) from pv_session_memory_detail p inner join pg_stat_activity c on c.pid = split_part(p.sessid,'.',2) group by 1;

  • 查看各个状态的内存使用总量,结果示例如下:

  • 如果idle状态的内存占用高,需要通过如下方式手动清理空闲连接(不影响业务):

select * from pgxc_clean_free_conn;

  • 如果active状态内存占用高,需要继续查看各语句的内存占用:

select 'select pg_terminate_backend(' || pid || ');' , state, query_id, pg_size_pretty(totalsize) total, pg_size_pretty(freesize) free, pg_size_pretty(usedsize) used, sqlstr FROM (SELECT state, query_id, pid, sum(totalsize) totalsize, sum(freesize) freesize, sum(usedsize) usedsize, substr(query,1,100) sqlstr FROM pv_session_memory_detail, pg_stat_activity WHERE pid=split_part(sessid,'.',2) and usename not in ('omm', 'Ruby') GROUP BY 1,2,3,7 ORDER BY 4 DESC);

  • 执行结果如下:

  • 与业务确认后,应急查杀内存占用高的语句。
  • 若业务上有upsert入库场景,且upsert由于主键冲突频繁触发报错,可能会导致相关线程短时间内内存冲高,可通过如下方式排查:

select 'select pg_terminate_backend(' || split_part(sessid,'.',2) || ');', split_part(sessid,'.',2) pid,pg_size_pretty(sum(totalsize)) total_size,count(*) context_count from pv_session_memory_detail group by pid order by sum(totalsize) desc limit 100;

  • 同时可以查看相关的实例日志,确认是否有大量报错(xxxx替换成实际的实例编号,日志时间换成实际的日志时间):
cd $GAUSSLOG/pg_log/xxxx
zgrep 'ERROR: '  postgresql-2023-11-11* | wc –l
  • 若日志中有大量报错,需要根据报错内容确认报错原因,减少相关报错(如常见的主键冲突、违反非空约束等场景)
zgrep 'ERROR: '  postgresql-2023-11-11* | more
  • 出现memory is temporarily unavailable的报错时,实例日志中会打印相关信息。根据报错的实例日志,在日志中搜索“abnormal”关键字找到当时内存使用最高的语句,找到对应的thread id,再根据thread id找到对应的query id(注意,非日志中的debug_query_id)。
  • 示例:如下日志打印中,根据abnormal关键字找到对应的thread id实际消耗内存为13G,该id对应的query id为76279718688746485,对应的语句为explain performance with xxx。

----debug_query_id=76279718689098154, The abnormal query thread id 140664667547392.It current used memory is 13618 MB and estimated memory is 1102 MB.It also is the query which costs the maximum memory. 

----debug_query_id=76279718689098154, It is not the current session and beentry info : datid<16389>, app_name<cn_5001>, query_id<76279718688746485>, tid<140664667547392>, lwtid<173496>, parent_tid<0>, thread_level<0>, query_string<explain performance with …>.

  • 对于事后分析的场景,还可以根据历史topsql中max_peak_memory字段找到问题时间点内存占用高的语句,max_peak_memory表示实际内存占用高的语句:

select * from pgxc_wlm_session_info where start_time > 'xxxx-xx-xx xx:xx:xx' and start_time < 'xxxx-xx-xx xx:xx:xx' order by max_peak_memory desc limit 100;

 

1.1.4 预防再次出现

  1. 根据最佳实践,打开动态资源管理,配置内存熔断方案,设置作业级/用户级/资源池级熔断参数
  2. 定时清理空闲连接
  3. upsert避免频繁报错问题
  4. 若总内存较小,需要考虑提升内存规格
【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。