CCE集群节点池之可视化监控
一 背景
在大规模集群应用中,为了帮助用户更好的管理Kubernetes集群内的节点,CCE集群提供了节点池功能。通过节点池可以实现节点的动态扩缩容。
这种背景条件下对节点池状态的监控也尤为重要,知晓节点池什么时间段进行伸缩活动,节点池震荡频率,同时还需要监控节点池整体的资源分配率,节点池中整体资源的使用率,也能更好的判断负载资源的分配是否合理。
二 方案简介
- 通过kube-state-metrics服务暴露的node层面的指标进行向量匹配查询实现节点池数量趋势的监控。可用指标为
kube_node_labels
、kube_node_info
- 通过ksm提供的指标
kube_pod_container_resource_requests
、kube_node_status_allocatable
进行向量匹配查询实现节点池cpu、memory资源的分配率监控 - 通过ksm指标
kube_node_labels
结合node-exporter提供的指标node_cpu_seconds_total
进行节点池资源使用率的监控
三 演示操作
操作会分为两部分:
- 首先会确保PromQL语句编写,因为涉及向量匹配等复杂运算,这块查询语句的编写需要仔细操作
- 然后根据查询语句制作相关Grafana DashBoard提供可视化界面。
前提条件: CCE集群已经安装kube-prometheus-stack插件
3.1 编写PromQL语句
-
节点池中的节点数量
count(kube_node_info{job="kube-state-metrics-prom"} * on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom", label_cce_cloud_com_cce_nodepool!=""})by(label_cce_cloud_com_cce_nodepool)
以kube_node_info指标序列为基准,进行向量匹配,匹配标签为node,同时在kube_node_info序列中加入标签label_cce_cloud_com_cce_nodepool,该标签为节点池名称信息。指标kube_node_info和指标 kube_node_labels都有共同标签node,所以在进行计算的时候我们可以基于该标签进行计算。group_left表示左边的向量具有更多的基数(更多的标签)
查看CCE节点池节点情况: 数据匹配
-
节点池CPU分配率
sum((sum(kube_pod_container_resource_requests{job="kube-state-metrics-prom", resource='cpu',unit='core'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""})by(label_cce_cloud_com_cce_nodepool,pod,namespace))*on(pod,namespace)group_right(label_cce_cloud_com_cce_nodepool)(sum(kube_pod_status_phase{job="kube-state-metrics-prom", phase!~'Failed|Succeeded'})by(pod,namespace)==1))by(label_cce_cloud_com_cce_nodepool)/sum(kube_node_status_allocatable{job="kube-state-metrics-prom",resource='cpu',unit='core'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""})by(label_cce_cloud_com_cce_nodepool)*100
节点的CPU分配率一般可以通过聚合节点上所有容器的CPU request申请的值 除以节点的可分配总量。
对比节点池节点池CPU的分配率就需要一步步分解运算公式:
a. 首先将获取每个容器的cpu申请值,通过sum进行聚合,获取每个pod的cpu申请值:sum(kube_pod_container_resource_requests{job="kube-state-metrics-prom", resource='cpu',unit='core'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""}) by(label_cce_cloud_com_cce_nodepool,pod,namespace)
b. 由于上述查询出来Pod可能包含Failed、Succeeded等状态,这些状态下Pod持有的资源会释放出来。所以还需要通过向量匹配进行查询,获取running状态下pod资源使用情况,再根据节点池进行分组,即可求得每个节点池中Pod申请cpu资源的具体数值
sum((sum(kube_pod_container_resource_requests{job="kube-state-metrics-prom", resource='cpu',unit='core'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""})by(label_cce_cloud_com_cce_nodepool,pod,namespace)) * on(pod,namespace) group_right(label_cce_cloud_com_cce_nodepool)(sum(kube_pod_status_phase{job="kube-state-metrics-prom", phase!~'Failed|Succeeded'})by(pod,namespace)==1))by(label_cce_cloud_com_cce_nodepool)
c. 接着查询每个节点池下的节点可分配的资源是多少,使用指标kube_node_status_allocatable,再根据节点池分组进行求和,即可获取节点池中所有节点的可分配资源是多少。
sum(kube_node_status_allocatable{job="kube-state-metrics-prom",resource='cpu',unit='core'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""})by(label_cce_cloud_com_cce_nodepool)
d. 最后将上述获得的pod资源申请量除以节点池可分配的资源量即可得到该节点池的分配率
对比CCE节点池提供的监控信息: 完全一致
-
节点池CPU使用率
(100-avg by (label_cce_cloud_com_cce_nodepool)((irate (node_cpu_seconds_total{job="node-exporter",mode="idle"}[5m]))*on(node)group_left(label_cce_cloud_com_cce_nodepool) kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""}*100))
CPU 使用率是 CPU 除空闲(idle)状态之外的其他所有 CPU 状态的时间总和除以总的 CPU 时间得到的结果。借助avg函数和irate函数可以直接计算出idle模式下的cpu使用率。
irate函数相比rate函数具有更好的灵敏度,可以解决指标监控中的长尾问题。
node_cpu_seconds_total
指标不提供节点池标签,需要使用向量匹配借助kube_node_labels
指标
和CCE集群中节点池监控数据基本吻合
-
节点池memory分配率
sum((sum(kube_pod_container_resource_requests{job="kube-state-metrics-prom",resource='memory',unit='byte'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""})by(label_cce_cloud_com_cce_nodepool,pod,namespace))*on(pod,namespace)group_right(label_cce_cloud_com_cce_nodepool)(sum(kube_pod_status_phase{job="kube-state-metrics-prom", phase!~'Failed|Succeeded'})by(pod,namespace)==1))by(label_cce_cloud_com_cce_nodepool)/sum(kube_node_status_allocatable{job="kube-state-metrics-prom", resource='memory',unit='byte'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""})by(label_cce_cloud_com_cce_nodepool)*100
原理同上述CPU分配率的查询
可以发现数据基本吻合
-
节点池memory使用率
100-100*(sum by (label_cce_cloud_com_cce_nodepool)(node_memory_MemAvailable_bytes{job="node-exporter", cluster_name="lts-turbo"}*on(node) group_left(label_cce_cloud_com_cce_nodepool) kube_node_labels{job="kube-state-metrics-prom",label_cce_cloud_com_cce_nodepool!=""})/sum by (label_cce_cloud_com_cce_nodepool)(node_memory_MemTotal_bytes{job="node-exporter"}*on(node) group_left(label_cce_cloud_com_cce_nodepool) kube_node_labels{label_cce_cloud_com_cce_nodepool!=""}))
节点可用内存指标: node_memory_MemAvailable_bytes是从应用程序的角度看到的可用内存
节点总内存:node_memory_MemTotal_bytes。
如果对应到节点池层面就需要借助向量匹配结合kube_node_labels指标进行计算
监控数据基本吻合
3.2 对接Grafana实现节点池信息可视化
-
Grafana DashBoard 效果如下
-
除了对接开源Grafana之外,CCE控制台也集成了开箱即用的节点池视图。
-
DashBorad 文件参考如下:
备注: 设计Grafana版本变化或者数据源变化时,dashboard不一定完全兼容,需要进行微调。{ "annotations": { "list": [ { "builtIn": 1, "datasource": "-- Grafana --", "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "type": "dashboard" } ] }, "editable": true, "gnetId": null, "graphTooltip": 0, "id": 15, "iteration": 1706496831686, "links": [], "panels": [ { "datasource": "prometheus", "description": "节点池所有节点平均CPU分配率", "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "graph": false, "legend": false, "tooltip": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": true }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [] }, "gridPos": { "h": 7, "w": 12, "x": 0, "y": 0 }, "id": 2, "options": { "graph": {}, "legend": { "calcs": [ "last" ], "displayMode": "list", "placement": "bottom" }, "tooltipOptions": { "mode": "single" } }, "pluginVersion": "7.5.17", "targets": [ { "exemplar": true, "expr": "sum((sum(kube_pod_container_resource_requests{job=\"kube-state-metrics-prom\", resource='cpu',unit='core'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job=\"kube-state-metrics-prom\", label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"})by(label_cce_cloud_com_cce_nodepool,pod,namespace))*on(pod,namespace)group_right(label_cce_cloud_com_cce_nodepool)(sum(kube_pod_status_phase{job=\"kube-state-metrics-prom\",phase!~'Failed|Succeeded'})by(pod,namespace)==1))by(label_cce_cloud_com_cce_nodepool)/sum(kube_node_status_allocatable{job=\"kube-state-metrics-prom\", resource='cpu',unit='core'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job=\"kube-state-metrics-prom\", label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"})by(label_cce_cloud_com_cce_nodepool)*100", "interval": "", "legendFormat": "节点池:{{label_cce_cloud_com_cce_nodepool}}", "queryType": "randomWalk", "refId": "A" } ], "timeFrom": null, "timeShift": null, "title": "节点池CPU分配率", "type": "timeseries" }, { "datasource": "prometheus", "description": "节点池CPU平均使用率", "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "graph": false, "legend": false, "tooltip": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": true }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [] }, "gridPos": { "h": 7, "w": 12, "x": 12, "y": 0 }, "id": 4, "options": { "graph": {}, "legend": { "calcs": [ "last" ], "displayMode": "list", "placement": "bottom" }, "tooltipOptions": { "mode": "single" } }, "pluginVersion": "7.5.17", "targets": [ { "exemplar": true, "expr": "(100-avg by (label_cce_cloud_com_cce_nodepool)((irate (node_cpu_seconds_total{job=\"node-exporter\", mode=\"idle\"}[5m]))*on(node)group_left(label_cce_cloud_com_cce_nodepool) kube_node_labels{job=\"kube-state-metrics-prom\",label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"}*100))", "interval": "", "legendFormat": "节点池: {{label_cce_cloud_com_cce_nodepool}}", "queryType": "randomWalk", "refId": "A" } ], "timeFrom": null, "timeShift": null, "title": "节点池CPU平均使用率", "type": "timeseries" }, { "datasource": "prometheus", "description": "节点池内存分配率", "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "graph": false, "legend": false, "tooltip": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": true }, "decimals": 2, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [] }, "gridPos": { "h": 7, "w": 12, "x": 0, "y": 7 }, "id": 6, "options": { "graph": {}, "legend": { "calcs": [ "last" ], "displayMode": "list", "placement": "bottom" }, "tooltipOptions": { "mode": "single" } }, "pluginVersion": "7.5.17", "targets": [ { "exemplar": true, "expr": "sum((sum(kube_pod_container_resource_requests{job=\"kube-state-metrics-prom\",resource='memory',unit='byte'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job=\"kube-state-metrics-prom\", label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"})by(label_cce_cloud_com_cce_nodepool,pod,namespace))*on(pod,namespace)group_right(label_cce_cloud_com_cce_nodepool)(sum(kube_pod_status_phase{job=\"kube-state-metrics-prom\", phase!~'Failed|Succeeded'})by(pod,namespace)==1))by(label_cce_cloud_com_cce_nodepool)/sum(kube_node_status_allocatable{job=\"kube-state-metrics-prom\",resource='memory',unit='byte'}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job=\"kube-state-metrics-prom\", label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"})by(label_cce_cloud_com_cce_nodepool)*100\r\n", "interval": "", "legendFormat": "节点池:{{label_cce_cloud_com_cce_nodepool}}", "queryType": "randomWalk", "refId": "A" } ], "timeFrom": null, "timeShift": null, "title": "节点池内存分配率", "type": "timeseries" }, { "datasource": "prometheus", "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "graph": false, "legend": false, "tooltip": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": true }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [] }, "gridPos": { "h": 7, "w": 12, "x": 12, "y": 7 }, "id": 8, "options": { "graph": {}, "legend": { "calcs": [ "last" ], "displayMode": "list", "placement": "bottom" }, "tooltipOptions": { "mode": "single" } }, "pluginVersion": "7.5.17", "targets": [ { "exemplar": true, "expr": "100-100*(sum by (label_cce_cloud_com_cce_nodepool)(node_memory_MemAvailable_bytes{job=\"node-exporter\"}*on(node) group_left(label_cce_cloud_com_cce_nodepool) kube_node_labels{job=\"kube-state-metrics-prom\",label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"})/sum by (label_cce_cloud_com_cce_nodepool)(node_memory_MemTotal_bytes{job=\"node-exporter\"}*on(node) group_left(label_cce_cloud_com_cce_nodepool) kube_node_labels{label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"}))", "interval": "", "legendFormat": "节点池: {{label_cce_cloud_com_cce_nodepool}}", "queryType": "randomWalk", "refId": "A" } ], "title": "节点池内存平均使用率", "type": "timeseries" }, { "datasource": "prometheus", "description": "节点池中的节点数量变化", "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "", "axisPlacement": "auto", "axisSoftMax": 4, "axisSoftMin": 0, "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "graph": false, "legend": false, "tooltip": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": true }, "decimals": 0, "mappings": [], "min": 0, "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [] }, "gridPos": { "h": 7, "w": 12, "x": 0, "y": 14 }, "id": 10, "options": { "graph": {}, "legend": { "calcs": [ "last" ], "displayMode": "list", "placement": "bottom" }, "tooltipOptions": { "mode": "single" } }, "pluginVersion": "7.5.17", "targets": [ { "exemplar": true, "expr": "count(kube_node_info{job=\"kube-state-metrics-prom\"}*on(node)group_left(label_cce_cloud_com_cce_nodepool)kube_node_labels{job=\"kube-state-metrics-prom\", label_cce_cloud_com_cce_nodepool=~\"$nodepool\",label_cce_cloud_com_cce_nodepool!=\"\"})by(label_cce_cloud_com_cce_nodepool)", "interval": "", "legendFormat": "节点池: {{label_cce_cloud_com_cce_nodepool}}", "queryType": "randomWalk", "refId": "A" } ], "title": "节点池数量趋势", "type": "timeseries" } ], "refresh": false, "schemaVersion": 27, "style": "dark", "tags": [], "templating": { "list": [ { "allValue": null, "current": { "selected": true, "tags": [], "text": [ "All" ], "value": [ "$__all" ] }, "datasource": "prometheus", "definition": "label_values(kube_node_labels{job=\"kube-state-metrics-prom\",label_cce_cloud_com_cce_nodepool!=\"\"},label_cce_cloud_com_cce_nodepool)", "description": "选择节点池", "error": null, "hide": 0, "includeAll": true, "label": "节点池", "multi": true, "name": "nodepool", "options": [ { "selected": true, "text": "All", "value": "$__all" }, { "selected": false, "text": "for-group-one", "value": "for-group-one" }, { "selected": false, "text": "hujiamin-use", "value": "hujiamin-use" } ], "query": { "query": "label_values(kube_node_labels{job=\"kube-state-metrics-prom\",label_cce_cloud_com_cce_nodepool!=\"\"},label_cce_cloud_com_cce_nodepool)", "refId": "StandardVariableQuery" }, "refresh": 0, "regex": "", "skipUrlSync": false, "sort": 0, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false } ] }, "time": { "from": "now-1h", "to": "now" }, "timepicker": {}, "timezone": "", "title": "节点池视图", "uid": "NyIBVctIz", "version": 3 }
- 点赞
- 收藏
- 关注作者
评论(0)