Prometheus系列——告警规则与告警-altermanager
【摘要】 prometheus ,告警规则与告警-altermanager
一、概述
下面开搞altermanager
二、安装配置启动altermanager
# 安装alertmanager
cd /usr/local/src/
wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
tar xzf alertmanager-0.21.0.linux-amd64.tar.gz -C /usr/local/prometheus/
ln -s /usr/local/prometheus/alertmanager-0.21.0.linux-amd64 /usr/local/prometheus/alertmanager
cd /usr/local/prometheus/alertmanager
#修改配置文件
cat >> alertmanager.yml<< "EOF"
global:
resolve_timeout: 5m
route:
group_by: ['instance']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'ops_notify'
routes:
- receiver: ops_notify
group_wait: 10s
match_re:
alertname: 'NodeStatsAlert'
receivers:
- name: 'ops_notify'
webhook_configs:
- url: 'http://localhost:5000'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
EOF
## 一下两种启动方式任选一种
# systemctl 启动
cat >> /usr/lib/systemd/system/alertmanager.service <<"EOF"
[Unit]
Description=Prometheus-server
Documentation=https://prometheus.io/
Requires=network.target remote-fs.target
After=network.target remote-fs.target
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/prometheus/altermanager/altermanager --config.file=/usr/local/prometheus/altermanager/altermanager.yml
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable alertmanager
systemctl start/stop/restart alertmanager
# supervisor 启动
cat >> /etc/supervisor.d/alertmanager.ini <<"EOF"
[program:alertmanager] #
command=/usr/local/prometheus/alertmanager/alertmanager --config.file=/usr/local/prometheus/alertmanager/alertmanager.yml ; the program (relative uses PATH, can take args)
numprocs=1 ; number of processes copies to start (def 1)
directory=/usr/local/prometheus/alertmanager/ ; directory to cwd to before exec (def no cwd)
autostart=true ; start at supervisord start (default: true)
autorestart=true ; retstart at unexpected quit (default: true)
startsecs=30 ; number of secs prog must stay running (def. 1)
startretries=3 ; max # of serial start failures (default 3)
exitcodes=0,2 ; 'expected' exit codes for process (default 0,2)
stopsignal=QUIT ; signal used to kill process (default TERM)
stopwaitsecs=10 ; max num secs to wait b4 SIGKILL (default 10)
user=root ; setuid to this UNIX account to run the program
redirect_stderr=true ; redirect proc stderr to stdout (default false)
stdout_logfile=/usr/local/prometheus/alertmanager/alertmanager.stdout.log ; stderr log path, NONE for none; default AUTO
stdout_logfile_maxbytes=64MB ; max # logfile bytes b4 rotation (default 50MB)
stdout_logfile_backups=4 ; # of stdout logfile backups (default 10)
stdout_capture_maxbytes=1MB ; number of bytes in 'capturemode' (default 0)
stdout_events_enabled=false ; emit events on stdout writes (default false)
stopasgroup=true
killasgroup=true
EOF
supervisorctl update
supervisorctl start/stop/restart alertmanager
# 检查启动是否成功
ps -ef |grep alertmanager
ss -luntp |grep 9094
上面如果需要增加alertmanager分组等配置增加即可,分组分webhook也可以。这里示例只有一个告警webhook,邮件网上找方法吧。
三、测试告警
blackbox_exporter检测node_exporter端口,告警规则是如果不通,超过1min就发出告警,规则如下:
rules:
# node_exporter状态
- alert: NodeExporterDown
expr: up{job="node_status"} != 1
for: 5m
labels:
severity: warning
level: 2
annotations:
summary: "端口9100探测失败"
description: "服务器{{ $labels.node_name }}端口9100探测失败,请尽快检查node_exporter是否出现异常!"
测试,将node_exporter进程杀死,此时可以打开prometheus-server的web界面的alerts选项查看是否有Firing,如果有了,到alertmanager的web页面的alerts选项查看是否有告警进来,如果有,就表示alertmanager能正常接收告警了,如果没有收到告警信息,此就是alertmanager和消息通知组件的问题了。请看下一章alertsalerts
【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)