CCE集群节点池之可视化监控

举报
可以交个朋友 发表于 2024/01/29 11:34:55 2024/01/29
【摘要】 这种背景条件下对节点池状态的监控也尤为重要,知晓节点池什么时间段进行伸缩活动,节点池震荡频率,同时还需要监控节点池整体的资源分配率,节点池中整体资源的使用率,也能更好的判断负载资源的分配是否合理。

一 背景

在大规模集群应用中,为了帮助用户更好的管理Kubernetes集群内的节点,CCE集群提供了节点池功能。通过节点池可以实现节点的动态扩缩容。
这种背景条件下对节点池状态的监控也尤为重要,知晓节点池什么时间段进行伸缩活动,节点池震荡频率,同时还需要监控节点池整体的资源分配率,节点池中整体资源的使用率,也能更好的判断负载资源的分配是否合理。
imageimage.png.png


二 方案简介

  • 通过kube-state-metrics服务暴露的node层面的指标进行向量匹配查询实现节点池数量趋势的监控。可用指标为kube_node_labelskube_node_info
  • 通过ksm提供的指标kube_pod_container_resource_requestskube_node_status_allocatable进行向量匹配查询实现节点池cpu、memory资源的分配率监控
  • 通过ksm指标kube_node_labels结合node-exporter提供的指标 node_cpu_seconds_total进行节点池资源使用率的监控

三 演示操作

操作会分为两部分:

  • 首先会确保PromQL语句编写,因为涉及向量匹配等复杂运算,这块查询语句的编写需要仔细操作
  • 然后根据查询语句制作相关Grafana DashBoard提供可视化界面。

前提条件: CCE集群已经安装kube-prometheus-stack插件
image.png


3.1 编写PromQL语句

  1. 节点池中的节点数量
    count(kube_node_info{job="kube-state-metrics-prom"} * on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom", label_cce_cloud_com_cce_nodepool!=""})by(label_cce_cloud_com_cce_nodepool)
    以kube_node_info指标序列为基准,进行向量匹配,匹配标签为node,同时在kube_node_info序列中加入标签label_cce_cloud_com_cce_nodepool,该标签为节点池名称信息。指标kube_node_info和指标 kube_node_labels都有共同标签node,所以在进行计算的时候我们可以基于该标签进行计算。group_left表示左边的向量具有更多的基数(更多的标签)
    image.png

    查看CCE节点池节点情况: 数据匹配
    image.png


  2. 节点池CPU分配率
    sum((sum(kube_pod_container_resource_requests{job="kube-state-metrics-prom", resource='cpu',unit='core'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""})by(label_cce_cloud_com_cce_nodepool,pod,namespace))*on(pod,namespace)group_right(label_cce_cloud_com_cce_nodepool)(sum(kube_pod_status_phase{job="kube-state-metrics-prom", phase!~'Failed|Succeeded'})by(pod,namespace)==1))by(label_cce_cloud_com_cce_nodepool)/sum(kube_node_status_allocatable{job="kube-state-metrics-prom",resource='cpu',unit='core'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""})by(label_cce_cloud_com_cce_nodepool)*100
    节点的CPU分配率一般可以通过聚合节点上所有容器的CPU request申请的值 除以节点的可分配总量。
    对比节点池节点池CPU的分配率就需要一步步分解运算公式:
    a. 首先将获取每个容器的cpu申请值,通过sum进行聚合,获取每个pod的cpu申请值:sum(kube_pod_container_resource_requests{job="kube-state-metrics-prom", resource='cpu',unit='core'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""}) by(label_cce_cloud_com_cce_nodepool,pod,namespace)
    image.png

    b. 由于上述查询出来Pod可能包含Failed、Succeeded等状态,这些状态下Pod持有的资源会释放出来。所以还需要通过向量匹配进行查询,获取running状态下pod资源使用情况,再根据节点池进行分组,即可求得每个节点池中Pod申请cpu资源的具体数值
    sum((sum(kube_pod_container_resource_requests{job="kube-state-metrics-prom", resource='cpu',unit='core'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""})by(label_cce_cloud_com_cce_nodepool,pod,namespace)) * on(pod,namespace) group_right(label_cce_cloud_com_cce_nodepool)(sum(kube_pod_status_phase{job="kube-state-metrics-prom", phase!~'Failed|Succeeded'})by(pod,namespace)==1))by(label_cce_cloud_com_cce_nodepool)
    image.png

    c. 接着查询每个节点池下的节点可分配的资源是多少,使用指标kube_node_status_allocatable,再根据节点池分组进行求和,即可获取节点池中所有节点的可分配资源是多少。
    sum(kube_node_status_allocatable{job="kube-state-metrics-prom",resource='cpu',unit='core'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""})by(label_cce_cloud_com_cce_nodepool)
    image.png

    d. 最后将上述获得的pod资源申请量除以节点池可分配的资源量即可得到该节点池的分配率
    image.png

    对比CCE节点池提供的监控信息: 完全一致
    image.png


  3. 节点池CPU使用率
    (100-avg by (label_cce_cloud_com_cce_nodepool)((irate (node_cpu_seconds_total{job="node-exporter",mode="idle"}[5m]))*on(node)group_left(label_cce_cloud_com_cce_nodepool) kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""}*100))
    CPU 使用率是 CPU 除空闲(idle)状态之外的其他所有 CPU 状态的时间总和除以总的 CPU 时间得到的结果。借助avg函数和irate函数可以直接计算出idle模式下的cpu使用率。
    irate函数相比rate函数具有更好的灵敏度,可以解决指标监控中的长尾问题。
    node_cpu_seconds_total指标不提供节点池标签,需要使用向量匹配借助kube_node_labels指标
    image.png

    和CCE集群中节点池监控数据基本吻合
    image.png


  1. 节点池memory分配率
    sum((sum(kube_pod_container_resource_requests{job="kube-state-metrics-prom",resource='memory',unit='byte'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""})by(label_cce_cloud_com_cce_nodepool,pod,namespace))*on(pod,namespace)group_right(label_cce_cloud_com_cce_nodepool)(sum(kube_pod_status_phase{job="kube-state-metrics-prom", phase!~'Failed|Succeeded'})by(pod,namespace)==1))by(label_cce_cloud_com_cce_nodepool)/sum(kube_node_status_allocatable{job="kube-state-metrics-prom", resource='memory',unit='byte'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""})by(label_cce_cloud_com_cce_nodepool)*100
    原理同上述CPU分配率的查询
    image.png

    可以发现数据基本吻合
    image.png


  2. 节点池memory使用率
    100-100*(sum by (label_cce_cloud_com_cce_nodepool)(node_memory_MemAvailable_bytes{job="node-exporter", cluster_name="lts-turbo"}*on(node) group_left(label_cce_cloud_com_cce_nodepool) kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""})/sum by (label_cce_cloud_com_cce_nodepool)(node_memory_MemTotal_bytes{job="node-exporter"}*on(node) group_left(label_cce_cloud_com_cce_nodepool) kube_node_labels{label_cce_cloud_com_cce_nodepool!=""}))
    节点可用内存指标: node_memory_MemAvailable_bytes是从应用程序的角度看到的可用内存
    节点总内存:node_memory_MemTotal_bytes。
    如果对应到节点池层面就需要借助向量匹配结合kube_node_labels指标进行计算
    image.png

    监控数据基本吻合
    image.png


3.2 对接Grafana实现节点池信息可视化

  1. Grafana DashBoard 效果如下
    image.png

  2. 除了对接开源Grafana之外,CCE控制台也集成了开箱即用的节点池视图。
    image.png

  3. DashBorad 文件参考如下:
    备注: 设计Grafana版本变化或者数据源变化时,dashboard不一定完全兼容,需要进行微调。

    {
      "annotations": {
        "list": [
          {
            "builtIn": 1,
            "datasource": "-- Grafana --",
            "enable": true,
            "hide": true,
            "iconColor": "rgba(0, 211, 255, 1)",
            "name": "Annotations & Alerts",
            "type": "dashboard"
          }
        ]
      },
      "editable": true,
      "gnetId": null,
      "graphTooltip": 0,
      "id": 15,
      "iteration": 1706496831686,
      "links": [],
      "panels": [
        {
          "datasource": "prometheus",
          "description": "节点池所有节点平均CPU分配率",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 10,
                "gradientMode": "none",
                "hideFrom": {
                  "graph": false,
                  "legend": false,
                  "tooltip": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "never",
                "spanNulls": true
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "short"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 7,
            "w": 12,
            "x": 0,
            "y": 0
          },
          "id": 2,
          "options": {
            "graph": {},
            "legend": {
              "calcs": [
                "last"
              ],
              "displayMode": "list",
              "placement": "bottom"
            },
            "tooltipOptions": {
              "mode": "single"
            }
          },
          "pluginVersion": "7.5.17",
          "targets": [
            {
              "exemplar": true,
              "expr": "sum((sum(kube_pod_container_resource_requests{job=\"kube-state-metrics-prom\", resource='cpu',unit='core'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job=\"kube-state-metrics-prom\", label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"})by(label_cce_cloud_com_cce_nodepool,pod,namespace))*on(pod,namespace)group_right(label_cce_cloud_com_cce_nodepool)(sum(kube_pod_status_phase{job=\"kube-state-metrics-prom\",phase!~'Failed|Succeeded'})by(pod,namespace)==1))by(label_cce_cloud_com_cce_nodepool)/sum(kube_node_status_allocatable{job=\"kube-state-metrics-prom\", resource='cpu',unit='core'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job=\"kube-state-metrics-prom\", label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"})by(label_cce_cloud_com_cce_nodepool)*100",
              "interval": "",
              "legendFormat": "节点池:{{label_cce_cloud_com_cce_nodepool}}",
              "queryType": "randomWalk",
              "refId": "A"
            }
          ],
          "timeFrom": null,
          "timeShift": null,
          "title": "节点池CPU分配率",
          "type": "timeseries"
        },
        {
          "datasource": "prometheus",
          "description": "节点池CPU平均使用率",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 10,
                "gradientMode": "none",
                "hideFrom": {
                  "graph": false,
                  "legend": false,
                  "tooltip": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "never",
                "spanNulls": true
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "short"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 7,
            "w": 12,
            "x": 12,
            "y": 0
          },
          "id": 4,
          "options": {
            "graph": {},
            "legend": {
              "calcs": [
                "last"
              ],
              "displayMode": "list",
              "placement": "bottom"
            },
            "tooltipOptions": {
              "mode": "single"
            }
          },
          "pluginVersion": "7.5.17",
          "targets": [
            {
              "exemplar": true,
              "expr": "(100-avg by (label_cce_cloud_com_cce_nodepool)((irate (node_cpu_seconds_total{job=\"node-exporter\", mode=\"idle\"}[5m]))*on(node)group_left(label_cce_cloud_com_cce_nodepool) kube_node_labels{job=\"kube-state-metrics-prom\",label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"}*100))",
              "interval": "",
              "legendFormat": "节点池: {{label_cce_cloud_com_cce_nodepool}}",
              "queryType": "randomWalk",
              "refId": "A"
            }
          ],
          "timeFrom": null,
          "timeShift": null,
          "title": "节点池CPU平均使用率",
          "type": "timeseries"
        },
        {
          "datasource": "prometheus",
          "description": "节点池内存分配率",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 10,
                "gradientMode": "none",
                "hideFrom": {
                  "graph": false,
                  "legend": false,
                  "tooltip": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "never",
                "spanNulls": true
              },
              "decimals": 2,
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "short"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 7,
            "w": 12,
            "x": 0,
            "y": 7
          },
          "id": 6,
          "options": {
            "graph": {},
            "legend": {
              "calcs": [
                "last"
              ],
              "displayMode": "list",
              "placement": "bottom"
            },
            "tooltipOptions": {
              "mode": "single"
            }
          },
          "pluginVersion": "7.5.17",
          "targets": [
            {
              "exemplar": true,
              "expr": "sum((sum(kube_pod_container_resource_requests{job=\"kube-state-metrics-prom\",resource='memory',unit='byte'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job=\"kube-state-metrics-prom\", label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"})by(label_cce_cloud_com_cce_nodepool,pod,namespace))*on(pod,namespace)group_right(label_cce_cloud_com_cce_nodepool)(sum(kube_pod_status_phase{job=\"kube-state-metrics-prom\", phase!~'Failed|Succeeded'})by(pod,namespace)==1))by(label_cce_cloud_com_cce_nodepool)/sum(kube_node_status_allocatable{job=\"kube-state-metrics-prom\",resource='memory',unit='byte'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job=\"kube-state-metrics-prom\", label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"})by(label_cce_cloud_com_cce_nodepool)*100\r\n",
              "interval": "",
              "legendFormat": "节点池:{{label_cce_cloud_com_cce_nodepool}}",
              "queryType": "randomWalk",
              "refId": "A"
            }
          ],
          "timeFrom": null,
          "timeShift": null,
          "title": "节点池内存分配率",
          "type": "timeseries"
        },
        {
          "datasource": "prometheus",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 10,
                "gradientMode": "none",
                "hideFrom": {
                  "graph": false,
                  "legend": false,
                  "tooltip": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "never",
                "spanNulls": true
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "short"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 7,
            "w": 12,
            "x": 12,
            "y": 7
          },
          "id": 8,
          "options": {
            "graph": {},
            "legend": {
              "calcs": [
                "last"
              ],
              "displayMode": "list",
              "placement": "bottom"
            },
            "tooltipOptions": {
              "mode": "single"
            }
          },
          "pluginVersion": "7.5.17",
          "targets": [
            {
              "exemplar": true,
              "expr": "100-100*(sum by (label_cce_cloud_com_cce_nodepool)(node_memory_MemAvailable_bytes{job=\"node-exporter\"}*on(node) group_left(label_cce_cloud_com_cce_nodepool) kube_node_labels{job=\"kube-state-metrics-prom\",label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"})/sum by (label_cce_cloud_com_cce_nodepool)(node_memory_MemTotal_bytes{job=\"node-exporter\"}*on(node) group_left(label_cce_cloud_com_cce_nodepool) kube_node_labels{label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"}))",
              "interval": "",
              "legendFormat": "节点池: {{label_cce_cloud_com_cce_nodepool}}",
              "queryType": "randomWalk",
              "refId": "A"
            }
          ],
          "title": "节点池内存平均使用率",
          "type": "timeseries"
        },
        {
          "datasource": "prometheus",
          "description": "节点池中的节点数量变化",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisLabel": "",
                "axisPlacement": "auto",
                "axisSoftMax": 4,
                "axisSoftMin": 0,
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 10,
                "gradientMode": "none",
                "hideFrom": {
                  "graph": false,
                  "legend": false,
                  "tooltip": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "never",
                "spanNulls": true
              },
              "decimals": 0,
              "mappings": [],
              "min": 0,
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "short"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 7,
            "w": 12,
            "x": 0,
            "y": 14
          },
          "id": 10,
          "options": {
            "graph": {},
            "legend": {
              "calcs": [
                "last"
              ],
              "displayMode": "list",
              "placement": "bottom"
            },
            "tooltipOptions": {
              "mode": "single"
            }
          },
          "pluginVersion": "7.5.17",
          "targets": [
            {
              "exemplar": true,
              "expr": "count(kube_node_info{job=\"kube-state-metrics-prom\"}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job=\"kube-state-metrics-prom\", label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"})by(label_cce_cloud_com_cce_nodepool)",
              "interval": "",
              "legendFormat": "节点池: {{label_cce_cloud_com_cce_nodepool}}",
              "queryType": "randomWalk",
              "refId": "A"
            }
          ],
          "title": "节点池数量趋势",
          "type": "timeseries"
        }
      ],
      "refresh": false,
      "schemaVersion": 27,
      "style": "dark",
      "tags": [],
      "templating": {
        "list": [
          {
            "allValue": null,
            "current": {
              "selected": true,
              "tags": [],
              "text": [
                "All"
              ],
              "value": [
                "$__all"
              ]
            },
            "datasource": "prometheus",
            "definition": "label_values(kube_node_labels{job=\"kube-state-metrics-prom\",label_cce_cloud_com_cce_nodepool!=\"\"},label_cce_cloud_com_cce_nodepool)",
            "description": "选择节点池",
            "error": null,
            "hide": 0,
            "includeAll": true,
            "label": "节点池",
            "multi": true,
            "name": "nodepool",
            "options": [
              {
                "selected": true,
                "text": "All",
                "value": "$__all"
              },
              {
                "selected": false,
                "text": "for-group-one",
                "value": "for-group-one"
              },
              {
                "selected": false,
                "text": "hujiamin-use",
                "value": "hujiamin-use"
              }
            ],
            "query": {
              "query": "label_values(kube_node_labels{job=\"kube-state-metrics-prom\",label_cce_cloud_com_cce_nodepool!=\"\"},label_cce_cloud_com_cce_nodepool)",
              "refId": "StandardVariableQuery"
            },
            "refresh": 0,
            "regex": "",
            "skipUrlSync": false,
            "sort": 0,
            "tagValuesQuery": "",
            "tags": [],
            "tagsQuery": "",
            "type": "query",
            "useTags": false
          }
        ]
      },
      "time": {
        "from": "now-1h",
        "to": "now"
      },
      "timepicker": {},
      "timezone": "",
      "title": "节点池视图",
      "uid": "NyIBVctIz",
      "version": 3
    }
    
【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。