Prometheus系列--node_exporter的安装与使用
一、简述
prometheus主要是通过一些exporter 进行监控信息的采集,比如:
node_exporter采集主机信息;
jmx_exporter采集java程序运行信息;
mysqld_exporter采集mysql相关信息;
redis_exporter采集redis相关信息;
blackbox_exporter采集http、dns、tcp、icmp、post、ssl等相关信息;
snmp_exporter采集一些网络设备的信息
pushgateway,可以实现跨网络的信息采集
其中node_exporter的Collectors和pushgateway,可以实现自定义监控指标。
二、安装node_exporter
# 下载node_exporter
cd /usr/local/src/
wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz
# 解压node_exporter
mkdir -pv /usr/local/prometheus/
tar xzf node_exporter-1.1.2.linux-amd64.tar.gz -C /usr/local/prometheus/
ln -s /usr/local/prometheus/node_exporter-1.1.2.linux-amd64 /usr/local/prometheus/node_exporter
# 编辑systemd启动文件,和下方的supervisor二选一即可。
cat >> /usr/lib/systemd/system/node_exporter.service << "EOF"
[Unit]
Description=Prometheus node_exporter
Requires=network.target remote-fs.target
After=network.target remote-fs.target
[Service]
Type=simple
User=root
Group=root
ExecStart=/usr/local/prometheus/node_exporter --web.listen-address=0.0.0.0:9100
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
EOF
# 使用supervisor管理node_exporter程序,和上方的systemd二选一即可。
cat >> /etc/supervisord.d/node_exporter.ini << "EOF"
[program:node_exporter] #
command=/usr/local/prometheus/node_exporter/node_exporter --web.listen-address=0.0.0.0:9100 ; the program (relative uses PATH, can take args)
numprocs=1 ; number of processes copies to start (def 1)
directory=/usr/local/prometheus/node_exporter ; directory to cwd to before exec (def no cwd)
autostart=true ; start at supervisord start (default: true)
autorestart=true ; retstart at unexpected quit (default: true)
startsecs=30 ; number of secs prog must stay running (def. 1)
startretries=3 ; max # of serial start failures (default 3)
exitcodes=0,2 ; 'expected' exit codes for process (default 0,2)
stopsignal=QUIT ; signal used to kill process (default TERM)
stopwaitsecs=10 ; max num secs to wait b4 SIGKILL (default 10)
user=root ; setuid to this UNIX account to run the program
redirect_stderr=true ; redirect proc stderr to stdout (default false)
stdout_logfile=/usr/local/node_exporter/node_exporter.stdout.log ; stderr log path, NONE for none; default AUTO
stdout_logfile_maxbytes=64MB ; max # logfile bytes b4 rotation (default 50MB)
stdout_logfile_backups=4 ; # of stdout logfile backups (default 10)
stdout_capture_maxbytes=1MB ; number of bytes in 'capturemode' (default 0)
stdout_events_enabled=false ; emit events on stdout writes (default false)
stopasgroup=true
killasgroup=true
EOF
#启动
# systemd 方式启动
systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter
systemctl status node_exporter
# supervisor方式启动
supervisorctl update
supervisorctl status
supervisorctl start node_exporter
supervisorctl restart node_exporter
#检查是否启动成功
ss -untlp |grep 9100
ps -ef |grep node_exporter
如果启动不成功,使用systemd的,使用journal -xe 检查启动报错;使用supervisor,去日志文件检查启动报错。
三、配置prometheus-server获取节点监控信息
编辑 prometheus-server的配置文件 /usr/local/prometheus/prometheus.yml,在最后放添加node的监控
]# vim /usr/local/prometheus/prometheus.yml
"prometheus.yml" [只读] 75L, 2276C 1,1 顶端
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
......此处省略若干配置......
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
# monitor ecs node_exporter
- job_name: "node_status"
file_sd_configs:
- refresh_interval: 30s
files:
- ./filesd/node/*.yml
其中上面不得不提一下,如果获取主机监控信息时想指定获取的内容,需要配合使用以下params内容:
# 监控prometheus中的node_exporter
- job_name: 'prometheus_sever'
static_configs:
- targets: ['localhost:9100']
params:
collect[]:
- cpu
- meminfo
- diskstats
- netdev
- netstat
- filefd
- filesystem
- xfs
- loadavg
- filefd
- sockstat
不过这种形式,并不能让node_exporter不收集prometheuser不想要的指标,node_expoerter的采集压力还在,只是指标传到prometheus的时候的压力变小了。node_exporter可以通过--no-collector.<name>
参数来指定不想收集的指标,也可以通过--collector.<name>
参数来指定想要额外手机的指标。
其中yml文件示例:
- labels:
InstanceId: i-xxxxxxxxxxxxxaf
Name: ansible
PrivateIpAddress: 10.xx.xx.xx
State: running
category: ops
drtype: ''
env: prod
group: ''
lifecycle: long
module: ansible
node_name: ansible
project: common
provider: awscloud
resource: ecs
software: ansible
targets:
- 10.xx.xx.xx:9100
以上配置完成之后,需要重启prometheus-server,重启方式有三:
supervisorctl restart prometheus-server
systemctl restart prometheus-server
curl -X POST http://127.0.0.1:9090/-/reload (prometheus-server启用了--web.enable-lifecycle选项)
重启之后,就可以在prometheus提供的web页面的status-->targets中查看到监控的节点信息,如下图:
四、配置Grafana图形显示监控数据
在Grafana中 点击+号,import 导入下方的json即可,需要注意,需要自行调整一些比如数据来源之类的东西,还有根据自己的监控labels来填写dashboard的variables。
这个地方自己去grafana官网找吧,笔者自己也是借鉴的同事的,不放出了。
五、配置告警规则
配置告警规则需要在prometheus-server的配置文件中修改
]# vim /usr/local/prometheus/prometheus.yml
......以上省略若干......
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- ./rules/*yml
......以上省略若干......
编辑rules/目录下的rule文件
vim ./rules/hhh.yml
groups:
- name: NodeStatsAlert
# 告警级别定义 0 信息,1 警告,2 一般严重,3 严重,4 灾难
rules:
# 内存
- alert: MemUsage
expr: round(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) < 20 and round(node_memory_MemAvailable_bytes/1024/1024) < 1024
for: 5m
labels:
severity: warning
level: 2
annotations:
summary: "内存使用率超过80%且剩余不足1024MB!"
description: "服务器{{ $labels.node_name }}可用内存比率为{{ $value }}%,请尽快查看是否发生了进程OOM!"
#- alert: MemUsage
# expr: round(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) < 5 and round(node_memory_MemAvailable_bytes/1024/1024) < 512
# for: 5m
# labels:
# severity: critical
# level: 3
# annotations:
# summary: "内存使用率超过90%且剩余不足512MB"
# description: "服务器{{ $labels.node_name }}可用内存比率为{{ $value }}%,请尽快查看是否发生了进程OOM!"
# 网络
- alert: NetworkFlowInOverLoad
expr: sum without() (irate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 1000
for: 5m
labels:
severity: warning
level: 2
annotations:
summary: "每秒接收超过1G,当前每秒接收{{ $value }}兆字节"
description: "服务器{{ $labels.node_name }}每秒接收超过1G,当前每秒{{ $value }}Mbytes"
- alert: NetworkFlowOutOverLoad
expr: sum without() (irate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 1000
for: 5m
labels:
severity: warning
level: 2
annotations:
summary: "每秒发送超过1G,当前每秒发送{{ $value }}兆字节"
description: "服务器{{ $labels.node_name }}每秒发送超过1G,当前每秒{ $value }}Mbytes"
# # 磁盘预警
# - alert: DiskWillFillIn4Hours
# expr: predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs", node_name!~"db-backup.*"}[1h], 4 * 3600) < 0
# for: 5m
# labels:
# severity: warning
# level: 2
# annotations:
# summary: "未来4小时磁盘可能会耗尽"
# description: "服务器{{ $labels.node_name }}未来4小时磁盘可能会耗尽"
# 文件系统
- alert: DiskUsage
expr: round(node_filesystem_free_bytes/node_filesystem_size_bytes{fstype!~"tmpfs|rootfs"} * 100) <10 and round(node_filesystem_free_bytes{fstype!~"tmpfs|rootfs"}/1024/1024/1024) < 100
for: 5m
labels:
severity: warning
level: 2
annotations:
summary: "文件系统使用率高超过90%且剩余不足100GB,当前剩余率{{ $value }}%"
description: "服务器{{ $labels.node_name }}的文件系统{{ $labels.mountpoint }}使用率高,当前剩余率{{ $value }}%"
# 文件系统
#- alert: DiskUsage
# expr: round(node_filesystem_free_bytes/node_filesystem_size_bytes{fstype!~"tmpfs|rootfs"} * 100) <5 or round(node_filesystem_free_bytes{fstype!~"tmpfs|rootfs"}/1024/1024/1024) < 10
# for: 5m
# labels:
# severity: critical
# level: 3
# annotations:
# summary: "文件系统使用率超过95%或剩余不足10GB,当前剩余率{{ $value }}%"
# description: "服务器{{ $labels.node_name }}的文件系统{{ $labels.mountpoint }}使用率高,当前剩余率{{ $value }}%"
# inode监控
- alert: InodesUsage
expr: round(node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint ="/rootfs"} * 100) < 80
for: 5m
labels:
severity: warning
level: 2
annotations:
summary: "Inode使用率高超过80%,当前剩余率{{ $value }}%"
description: "服务器{{ $labels.node_name }}的文件系统{{ $labels.mountpoint }}Inode使用率高,当前剩余率{{ $value }}%"
#- alert: InodesUsage
# expr: round(node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint ="/rootfs"} * 100) < 95
# for: 5m
# labels:
# severity: critical
# level: 3
# annotations:
# summary: "Inode使用率高超过95%,当前剩余率{{ $value }}%"
# description: "服务器{{ $labels.node_name }}的文件系统{{ $labels.mountpoint }}Inode使用率高,当前剩余率{{ $value }}%"
# 写延迟
- alert: DiskWriteLatency
expr: round(rate(node_disk_write_time_seconds_total[5m]) / rate(node_disk_writes_completed_total[5m])) > 10
for: 3m
labels:
severity: warning
level: 2
annotations:
summary: "磁盘写延迟超过10毫秒,当前延迟为{{ $value }}毫秒"
description: "服务器{{ $labels.node_name }}的磁盘写延迟超过10ms,当前延迟为{{ $value }}毫秒"
## 写延迟
#- alert: DiskWriteLatency
# expr: round(rate(node_disk_write_time_seconds_total[5m]) / rate(node_disk_writes_completed_total[5m])) > 20
# for: 3m
# labels:
# severity: critical
# level: 3
# annotations:
# summary: "磁盘写延迟超过20毫秒,当前延迟为{{ $value }}毫秒"
# description: "服务器{{ $labels.node_name }}的磁盘写延迟超过20ms,当前延迟为{{ $value }}毫秒"
## CPU使用率
#- alert: CPUUsage
# expr: 100 - round((avg without() (irate(node_cpu_seconds_total{software!~"mysqlbak|flink|skywalking",mode="idle"}[5m])) * 100)) > 80
# for: 5m
# labels:
# severity: warning
# level: 2
# annotations:
# summary: "CPU使用率超过80%,当前使用率为{{ $value }}%"
# description: "服务器{{ $labels.node_name }}CPU使用率过高,当前使用率为{{ $value }}%"
# CPU使用率
- alert: CPUUsage
expr: 100 - avg((avg without() (irate(node_cpu_seconds_total{software!~"mysqlbak|flink|skywalking",mode="idle"}[5m])) * 100)) > 90
for: 5m
labels:
severity: warning
level: 3
annotations:
summary: "CPU使用率超过90%,当前使用率为{{ $value }}%"
description: "服务器{{ $labels.node_name }}CPU使用率过高,当前使用率为{{ $value }}%"
# LoadHigh
- alert: LoadHigh
expr: round(rate(node_load1[5m])) > 1
for: 5m
labels:
severity: warning
level: 2
annotations:
summary: "CPU负载超过3,当前load为{{ $value }}"
description: "服务器{{ $labels.node_name }}CPU load过高,当前load为{{ $value }}"
# LoadHigh
- alert: LoadHigh
expr: round(rate(node_load1[5m])) > 3
for: 5m
labels:
severity: critical
level: 3
annotations:
summary: "CPU负载超过5,当前load为{{ $value }}"
description: "服务器{{ $labels.node_name }}CPU load过高,当前load为{{ $value }}"
# 服务器重启
- alert: Restarted
expr: node_time_seconds - node_boot_time_seconds < 600
for: 5m
labels:
severity: warning
level: 2
annotations:
summary: "服务器重启!"
description: "服务器{{ $labels.node_name }}刚刚发生过重启,请检查服务器重启后服务是否正常,数据库请检查高可用、数据一致性!"
- alert: TooManyOpenFiles
expr: node_filefd_allocated > 15000
for: 5m
labels:
severity: warning
Level: P1
level: 2
annotations:
summary: "文件句柄打开过大,当前文件句柄数为{{ $value }}"
description: "文件句柄打开过大,当前文件句柄数为{{ $value }},一般为程序没有正确关闭文件句柄导致,该值过大可能导致主机拒绝服务,请尽快分析处理!"
- alert: TooManyOpenFiles
expr: node_filefd_allocated > 20000
for: 5m
labels:
severity: critical
Level: P0
level: 2
annotations:
summary: "文件句柄打开过大,当前文件句柄数为{{ $value }}"
description: "文件句柄打开过大,当前文件句柄数为{{ $value }},一般为程序没有正确关闭文件句柄导致,该值过大可能导致主机拒绝服务,请尽快分析处理!"
- alert: ProcessHigh
expr: node_processes_pids > 600
for: 5m
labels:
severity: warning
Level: P0
level: 2
annotations:
summary: "系统进程数过大,当前进程数为{{ $value }}"
description: "系统进程数过大,当前进程数为{{ $value }},可能是程序bug引起fork过多的进程,这可能导致服务器运行异常,请尽快分析处理!"
其中以上文件不要直接用,需要修改一些适合自己的。
验证的话,可是重启一下机器,看是否有告警,告警的话,可以在Prometheus提供的web页面中查看,也可以在altermanager中查看。
- 点赞
- 收藏
- 关注作者
评论(0)