GaussDB(DWS)监控工具指南(三)资源池级监控【绽放吧!GaussDB(DWS)云原生数仓】

举报
幕后小黑爪 发表于 2023/10/30 17:11:37 2023/10/30
【摘要】 资源池是数据库资源管控重要的一环,通过这篇文章您可以了解到资源池的来龙去脉,如何使用资源池,如何使用资源池监控去分析问题等。

一、资源池

        在数据库最初阶段,是没有资源概念的,给数据库输入SQL语句,数据库输出结果,在简单业务场景下,用户独占数据库是不存在资源争抢问题的。随着数据库业务增长,用户也越来越多,此时不同用户间的SQL会抢占操作系统的资源(CPU、内存、IO、网络等),如果不加限制的话就会影响整个集群的用户,造成集群不可用的情况。为了防止在这种场景发生,需要对用户业务SQL进行区分,对不同的用户需要资源分配和管控。为此,资源池应运而生,资源池作为一种逻辑媒介,连接用户和系统资源,管控每个用户的资源使用,保证集群的可用状态。

二、GaussDBDWS)中的资源池

       当管理员创建用户后,会自动绑定在默认资源池default_pool上,从网页上,可以再创建资源池,然后绑定相应的用户在对应的资源池上。此时用户下发SQL语句执行,下发的语句就会收到资源池配置参数的管控。通过资源池可以划分不同用户的资源使用情况,简化了集群的管理,它可以统一管理所有的系统计算资源。这意味着管理员只需要管理资源池,而不是每个节点上的资源。

       GaussDBDWS)资源池(租户)功能支持通过管控CPU、并发、内存、网络等手段对用户下发的业务语句进行管控,在不同的维度对用户语句进行管控。具体来说支持设置资源池参数的方式控制并发、内存、CPU利用率等能力。当用户的语句是未知新语句时,也可以通过一些异常规则来控制语句情况,比如查杀超过限制的烂SQL等,在此基础上,还支持设置黑名单的方式严格管控用户下发的语句,这些功能后续会进行说明,本文聚焦资源池先不展开。

        通过下列语句可查询资源池信息

postgres=# select * from pg_resource_pool;
     respool_name     | mem_percent | cpu_affinity |    control_group    | active_statements | max_dop | memory_limit |  parentid  | io_limits | io_priority |    nodegroup     | is_foreign | short_acc | except_rule | weight 
----------------------+-------------+--------------+---------------------+-------------------+---------+--------------+------------+-----------+-------------+------------------+------------+-----------+-------------+--------
 default_pool         |           0 |           -1 | DefaultClass:Medium |                -1 |      -1 | default      |          0 |         0 | None        | installation     | f          | t         | None        |     -1
 respool_1            |           0 |           -1 | ClassN1:wn1         |                10 |      -1 | default      |          0 |         0 | None        | logical_cluster1 | f          | t         | None        |     -1
 respool_grp_1        |          20 |           -1 | ClassG1             |                10 |      -1 | default      |          0 |         0 | None        | logical_cluster1 | f          | t         | None        |     -1
 respool_g1_job_1     |          20 |           -1 | ClassG1:wg1_1       |                10 |      -1 | default      | 2147484586 |         0 | None        | logical_cluster1 | f          | t         | None        |     -1
 respool_g1_job_2     |          20 |           -1 | ClassG1:wg1_2       |                10 |      -1 | default      | 2147484586 |         0 | None        | logical_cluster1 | f          | t         | None        |     -1
 respool_0_mempercent |           0 |           -1 | DefaultClass:Medium |                10 |      -1 | default      |          0 |         0 | None        | logical_cluster1 | f          | t         | None        |     -1
(6 rows)

          对于资源池的相关参数,用户可通过管控面进行配置,同时也支持管理员通过语句来修改资源池配置,如下所示,可修改默认资源池的快车道并发限制。其他参数同理,不过该操作有风险,建议用户在GaussDB(DWS)运维人员确认后执行。

alter resource pool default_pool with (max_dop=1);

三、资源池监控

       GaussDBDWS)为用户提供了多维度的资源监控视图,可支持从不同维度查询集群状态。


       ziyuanjiankong1.png

       GaussDBDWS)提供资源池级别的监控能力,监控SQL语句的运行情况,主要包含实时监控和历史监控,跟用户监控类似,资源池监控展示了每个资源池的运行作业数、排队作业数、内存使用、内存使用上限、 CPU使用情况、读写IO情况等,通过下列语句可进行查询获取

postgres=# select * from gs_respool_resource_info;
 nodegroup |    rpname    |       cgroup        | ref_count | fast_run | fast_wait | fast_limit | slow_run | slow_wait | slow_limit | used_cpu | cpu_limit | used_mem | estimate_mem | mem_limit | read_kbytes | write_kbytes | read_counts | write_counts | read_speed | write_speed
-----------+--------------+---------------------+-----------+----------+-----------+------------+----------+-----------+------------+----------+-----------+----------+--------------+-----------+-------------+--------------+-------------+--------------+------------+-------------
 lc1       | pool_group   | ClassN              |         0 |        0 |         0 |         -1 |        0 |         0 |         10 |        0 |       312 |        0 |            0 |    116844 |           0 |            0 |           0 |            0 |          0 |           0
 lc1       | pool_work    | ClassN:wg1          |         0 |        0 |         0 |         10 |        0 |         0 |         10 |        0 |       312 |        0 |            0 |     23364 |           0 |            0 |           0 |            0 |          0 |           0
 lc2       | default_pool | DefaultClass:Medium |         0 |        0 |         0 |         -1 |        0 |         0 |         -1 |        0 |       208 |        0 |            0 |    584220 |           0 |            0 |           0 |            0 |          0 |           0
 lc1       | resp_other   | DefaultClass:Medium |         0 |        0 |         0 |         -1 |        0 |         0 |        100 |        0 |       312 |        0 |            0 |    175260 |           0 |            0 |           0 |            0 |          0 |           0
 lc1       | default_pool | DefaultClass:Medium |         0 |        0 |         0 |         -1 |        0 |         0 |         -1 |        0 |       312 |        0 |            0 |    584220 |           0 |            0 |           0 |            0 |          0 |           0
(5 rows)

       其中,nodegroup 代表资源池所属的逻辑集群信息。fast_runslow_run代表资源池快车道运行数、慢车道运行数,fast_waitslow_wait为快车道和慢车道的排队作业数。其中,slow_wait也包含了CCN排队的作业。fast_limitslow_limit代表了快慢车道的并发上限,cpu_limit代表了资源池配置的cpu设置的限额是多少,通过used_cpu可以看到不同资源池的CPU使用率情况。estimate_mem为资源池内用户下发作业的估算内存总和,used_mem为资源池实际使用的内存,mem_limit为资源池设置的可用内存上限。

        同样,历史资源监控也提供了历史问题定位的能力,该视图会30s采集一次,timestamp为采集的时刻。

postgres=# select * from gs_respool_resource_history;
           timestamp           | nodegroup |    rpname    |       cgroup        | ref_count | fast_run | fast_wait | fast_limit | slow_run | slow_wait | slow_limit | used_cpu | cpu_limit | used_mem | estimate_mem | mem_limit | read_kbytes | write_kbytes | read_counts | write_counts | read_speed | write_speed
-------------------------------+-----------+--------------+---------------------+-----------+----------+-----------+------------+----------+-----------+------------+----------+-----------+----------+--------------+-----------+-------------+--------------+-------------+--------------+------------+-------------
 2023-10-20 20:24:14.715107+08 | lc1       | pool_group   | ClassN              |         0 |        0 |         0 |         -1 |        0 |         0 |         10 |        0 |       312 |        0 |            0 |    116844 |           0 |            0 |           0 |            0 |          0 |           0
 2023-10-20 20:24:14.715107+08 | lc1       | pool_work    | ClassN:wg1          |         0 |        0 |         0 |         10 |        0 |         0 |         10 |        0 |       312 |        0 |            0 |     23364 |           0 |            0 |           0 |            0 |          0 |           0
 2023-10-20 20:24:14.715107+08 | lc2       | default_pool | DefaultClass:Medium |         0 |        0 |         0 |         -1 |        0 |         0 |         -1 |        0 |       208 |        0 |            0 |    584220 |           0 |            0 |           0 |            0 |          0 |           0
 2023-10-20 20:24:14.715107+08 | lc1       | resp_other   | DefaultClass:Medium |         0 |        0 |         0 |         -1 |        0 |         0 |        100 |        0 |       312 |        0 |            0 |    175260 |           0 |            0 |           0 |            0 |          0 |           0
 2023-10-20 20:24:14.715107+08 | lc1       | default_pool | DefaultClass:Medium |         0 |        0 |         0 |         -1 |        0 |         0 |         -1 |        0 |       312 |        0 |            0 |    584220 |           0 |            0 |           0 |            0 |          0 |           0
 2023-10-20 20:24:44.791512+08 | lc1       | pool_group   | ClassN              |         0 |        0 |         0 |         -1 |        0 |         0 |         10 |        0 |       312 |        0 |            0 |    116844 |           0 |            0 |           0 |            0 |          0 |           0
 2023-10-20 20:24:44.791512+08 | lc1       | pool_work    | ClassN:wg1          |         0 |        0 |         0 |         10 |        0 |         0 |         10 |        0 |       312 |        0 |            0 |     23364 |           0 |            0 |           0 |            0 |          0 |           0
 2023-10-20 20:24:44.791512+08 | lc2       | default_pool | DefaultClass:Medium |         0 |        0 |         0 |         -1 |        0 |         0 |         -1 |        0 |       208 |        0 |            0 |    584220 |           0 |            0 |           0 |            0 |          0 |           0
 2023-10-20 20:24:44.791512+08 | lc1       | resp_other   | DefaultClass:Medium |         0 |        0 |         0 |         -1 |        0 |         0 |        100 |        0 |       312 |        0 |            0 |    175260 |           0 |            0 |           0 |            0 |          0 |           0
 2023-10-20 20:24:44.791512+08 | lc1       | default_pool | DefaultClass:Medium |         0 |        0 |         0 |         -1 |        0 |         0 |         -1 |        0 |       312 |        0 |            0 |    584220 |           0 |            0 |           0 |            0 |          0 |           0
...

四、通过资源池监控分析定位问题(待补充)

            1、当发现业务反馈语句阻塞严重,不执行作业,可查询实时资源池监控或者历史资源池监控进行分析,看是否作业堆积排队,如果fast_limit上的限制为10,,fast_run字段也为10,然后fast_wait较多,此时可尝试修改资源池参数max_dop,适当调大并发上限。

            2、业务反馈跑批业务在正常运行的情况下批量变慢,此时可观察历史资源池监控对比观察,统计劣化前后同一批作业一段时间的内存资源使用情况,也可通过当时资源池作业排队现象来定位问题。

五、更好用的监控视图

        为了提升系统可用性,GaussDB(DWS)也提供了更便捷,更易用的视图用以帮用户进行观察系统状态和定位问题。  

       在内核821版本中,用户可使用gs_query_monitorgs_user_monitorgs_respool_monitor视图进行语句级、用户级、资源池的资源监控,这些视图以GaussDB(DWS)监控工具指南系列中所讲的视图为基础,选取常用的定位字段,为现网用户提供更易用的一套实时监控脚本。

具体效果如下:

1. 作业监控

postgres=# select * from gs_query_monitor;
   usename    | nodename |    nodegroup     |    rpname    | priority |          xact_start           |          query_start          | block_time | duration | query_band |  attribute  | lane | status  | queue | used_mem | estimate_mem | used_cpu | read_speed | write_speed | send_speed | recv_speed | dn_count | stream_count |       pid       | lw
tid  |     query_id      | unique_sql_id |                      query                       
--------------+----------+------------------+--------------+----------+-------------------------------+-------------------------------+------------+----------+------------+-------------+------+---------+-------+----------+--------------+----------+------------+-------------+------------+------------+----------+--------------+-----------------+---
-----+-------------------+---------------+--------------------------------------------------
 user_default | cn_5001  | logical_cluster1 | default_pool | Medium   | 2023-10-30 16:39:28.754207+08 | 2023-10-30 16:39:28.748855+08 |         59 |        0 |            | Complicated | slow | pending | CCN   |        0 |         4360 |        0 |          0 |           0 |          0 |          0 |        0 |            0 | 139906878865264 | 98
2280 | 72902018968076864 |    2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);
 user_default | cn_5001  | logical_cluster1 | default_pool | Medium   | 2023-10-30 16:39:28.760305+08 | 2023-10-30 16:39:28.754861+08 |         59 |        0 |            | Complicated | slow | pending | CCN   |        0 |         4360 |        0 |          0 |           0 |          0 |          0 |        0 |            0 | 139906878866632 | 98
2283 | 72902018968076871 |    2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);
 user_default | cn_5001  | logical_cluster1 | default_pool | Medium   | 2023-10-30 16:39:28.761491+08 | 2023-10-30 16:39:28.756124+08 |         59 |        0 |            | Complicated | slow | pending | CCN   |        0 |         4360 |        0 |          0 |           0 |          0 |          0 |        0 |            0 | 139906878865720 | 98
2281 | 72902018968076872 |    2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);
 user_default | cn_5001  | logical_cluster1 | default_pool | Medium   | 2023-10-30 16:39:28.768333+08 | 2023-10-30 16:39:28.762653+08 |         59 |        0 |            | Complicated | slow | pending | CCN   |        0 |         4360 |        0 |          0 |           0 |          0 |          0 |        0 |            0 | 139906878867544 | 98
2285 | 72902018968076877 |    2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);
 user_default | cn_5001  | logical_cluster1 | default_pool | Medium   | 2023-10-30 16:39:28.772288+08 | 2023-10-30 16:39:28.766933+08 |         59 |        0 |            | Complicated | slow | pending | CCN   |        0 |         4360 |        0 |          0 |           0 |          0 |          0 |        0 |            0 | 139906878868912 | 98
2288 | 72902018968076881 |    2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);
 user_default | cn_5001  | logical_cluster1 | default_pool | Medium   | 2023-10-30 16:39:28.772304+08 | 2023-10-30 16:39:28.766966+08 |         59 |        0 |            | Complicated | slow | pending | CCN   |        0 |         4360 |        0 |          0 |           0 |          0 |          0 |        0 |            0 | 139906878867088 | 98
2284 | 72902018968076882 |    2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);
 user_default | cn_5001  | logical_cluster1 | default_pool | Medium   | 2023-10-30 16:39:28.777958+08 | 2023-10-30 16:39:28.772572+08 |         59 |        0 |            | Complicated | slow | pending | CCN   |        0 |         4360 |        0 |          0 |           0 |          0 |          0 |        0 |            0 | 139906878868000 | 98
2286 | 72902018968076888 |    2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);
 user_default | cn_5001  | logical_cluster1 | default_pool | Medium   | 2023-10-30 16:39:28.779373+08 | 2023-10-30 16:39:28.773997+08 |         59 |        0 |            | Complicated | slow | pending | CCN   |        0 |         4360 |        0 |          0 |           0 |          0 |          0 |        0 |            0 | 139906878868456 | 98
2287 | 72902018968076889 |    2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);
 user_default | cn_5001  | logical_cluster1 | default_pool | Medium   | 2023-10-30 16:39:28.753845+08 | 2023-10-30 16:39:28.748498+08 |          0 |       59 |            | Complicated | slow | running | None  |        4 |         4360 |     .289 |          0 |           0 |          0 |          0 |        0 |            0 | 139906878864808 | 98
2279 | 72902018968076862 |    2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);
 user_default | cn_5001  | logical_cluster1 | default_pool | Medium   | 2023-10-30 16:39:28.753957+08 | 2023-10-30 16:39:28.748609+08 |          0 |       59 |            | Complicated | slow | running | None  |        4 |         4360 |     .288 |          0 |           0 |         17 |         23 |        0 |            0 | 139906878866176 | 98
2282 | 72902018968076863 |    2372000271 | INSERT INTO t1 SELECT generate_series(1,100000);
(10 rows)

2. 用户监控

postgres=# select * from gs_user_monitor;
     usename      |    rpname     |    nodegroup     | session_count | active_count | global_wait | fast_run | fast_wait | slow_run | slow_wait | used_mem | estimate_mem | used_cpu | read_speed | write_speed | send_speed | recv_speed | used_space | space_limit | used_temp_space | temp_space_limit | used_spill_space | spill_space_limit 
------------------+---------------+------------------+---------------+--------------+-------------+----------+-----------+----------+-----------+----------+--------------+----------+------------+-------------+------------+------------+------------+-------------+-----------------+------------------+------------------+-------------------
 logical_cluster2 | default_pool  | logical_cluster2 |             0 |            0 |           0 |        0 |         0 |        0 |         0 |        0 |            0 |        0 |          0 |           0 |          0 |          0 |          0 |          -1 |               0 |               -1 |                0 |                -1
 user_grp_1       | respool_grp_1 | logical_cluster1 |             0 |            0 |           0 |        0 |         0 |        0 |         0 |        0 |            0 |        0 |          0 |           0 |          0 |          0 |          0 |          -1 |               0 |               -1 |                0 |                -1
 logical_cluster1 | default_pool  | logical_cluster1 |             0 |            0 |           0 |        0 |         0 |        0 |         0 |        0 |            0 |        0 |          0 |           0 |          0 |          0 |    1834424 |          -1 |               0 |               -1 |                0 |                -1
 user_normal      | respool_1     | logical_cluster1 |             0 |            0 |           0 |        0 |         0 |        0 |         0 |        0 |            0 |        0 |          0 |           0 |          0 |          0 |          0 |          -1 |               0 |               -1 |                0 |                -1
 user_default     | default_pool  | logical_cluster1 |            10 |           10 |           0 |        0 |         0 |        2 |         8 |        8 |         8720 |     .563 |          0 |          15 |          0 |          0 |     640080 |          -1 |               0 |               -1 |                0 |                -1
(5 rows)

3. 资源池监控

postgres=# select * from gs_respool_monitor;
        rpname        |    nodegroup     | cn_count | short_acc | session_count | active_count | global_wait | fast_run | fast_wait | fast_limit | slow_run | slow_wait | slow_limit | used_mem | estimate_mem | mem_limit | query_mem_limit | used_cpu | cpu_limit | read_speed | write_speed | send_speed | recv_speed 
----------------------+------------------+----------+-----------+---------------+--------------+-------------+----------+-----------+------------+----------+-----------+------------+----------+--------------+-----------+-----------------+----------+-----------+------------+-------------+------------+------------
 default_pool         | logical_cluster2 |        3 | t         |             0 |            0 |           0 |        0 |         0 |         -1 |        0 |         0 |         -1 | 0 bytes  | 0 bytes      | 11 GB     | 4376 MB         |        0 |         8 | 0 bytes/s  | 0 bytes/s   | 0 bytes/s  | 0 bytes/s
 respool_g1_job_1     | logical_cluster1 |        3 | t         |             0 |            0 |           0 |        0 |         0 |         -1 |        0 |         0 |         10 | 0 bytes  | 0 bytes      | 437 MB    | 175 MB          |        0 |         8 | 0 bytes/s  | 0 bytes/s   | 0 bytes/s  | 0 bytes/s
 respool_1            | logical_cluster1 |        3 | t         |             0 |            0 |           0 |        0 |         0 |         -1 |        0 |         0 |         10 | 0 bytes  | 0 bytes      | 11 GB     | 4376 MB         |        0 |         8 | 0 bytes/s  | 0 bytes/s   | 0 bytes/s  | 0 bytes/s
 respool_0_mempercent | logical_cluster1 |        3 | t         |             0 |            0 |           0 |        0 |         0 |         -1 |        0 |         0 |         10 | 0 bytes  | 0 bytes      | 11 GB     | 4376 MB         |        0 |         8 | 0 bytes/s  | 0 bytes/s   | 0 bytes/s  | 0 bytes/s
 respool_g1_job_2     | logical_cluster1 |        3 | t         |             0 |            0 |           0 |        0 |         0 |         -1 |        0 |         0 |         10 | 0 bytes  | 0 bytes      | 437 MB    | 175 MB          |        0 |         8 | 0 bytes/s  | 0 bytes/s   | 0 bytes/s  | 0 bytes/s
 default_pool         | logical_cluster1 |        3 | t         |            10 |           10 |           0 |        0 |         0 |         -1 |        2 |         8 |         -1 | 8192 KB  | 8720 MB      | 11 GB     | 4376 MB         |     .577 |         8 | 0 bytes/s  | 17 KB/s     | 0 bytes/s  | 0 bytes/s
(6 rows)




我正在参加【有奖征文 第27期】绽放吧!GaussDB(DWS)云原生数仓!

【有奖征文 第27期】绽放吧!GaussDB(DWS)云原生数仓,发文赢取GaussDB(DWS)开发者认证考试劵等更多好礼!-云社区-华为云 (huaweicloud.com)

【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。