1. alertManager部署
1.1 前言
我们是用prometheus-operator搭建的监控系统,将alertManager定义为CRD资源了
1.2 修改alertManager的配置
prometheus-operator是通过secret资源对象挂载到alertManager中的,所以我们需要修改secret
alertmanager-secret.yaml
apiVersion: v1
data: {}
kind: Secret
metadata:
name: alertmanager-main
namespace: monitoring
stringData:
alertmanager.yaml: |-
"global":
"resolve_timeout": "5m"
"templates":
- "/etc/alertmanager/config/wechat.tmpl"
"inhibit_rules":
- "equal":
- "namespace"
- "alertname"
"source_match":
"severity": "critical"
"target_match_re":
"severity": "warning|info"
- "equal":
- "namespace"
- "alertname"
"source_match":
"severity": "warning"
"target_match_re":
"severity": "info"
"receivers":
- "name": "Default"
"wechat_configs":
- "send_resolved": true
"agent_id": "1000002"
"corp_id": "wwf3f6202201dd3fde"
"api_secret": "dF8qnFUXkZ0lBnRG-fAIUKI9sa3l8XRvbutxtx6M8Ng"
"to_user": "@all"
- "name": "Watchdog"
- "name": "Critical"
"wechat_configs":
- "send_resolved": true
"agent_id": "1000002"
"corp_id": "wwf********1dd3fde"
"api_secret": "dF8qn****************6M8Ng"
"to_user": "@all"
"route":
"group_by":
- "namespace"
- "alertname"
"group_interval": "5m"
"group_wait": "30s"
"receiver": "Default"
"repeat_interval": "12h"
"routes":
- "match":
"alertname": "Watchdog"
"receiver": "Watchdog"
- "match":
"severity": "critical"
"receiver": "Critical"
wechat.tmpl: |-
# 告警消息模板
type: Opaque
2. alertManager配置
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
global:
[ resolve_timeout: <duration> | default = 5m ]
[ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
[ wechat_api_secret: <secret> ]
[ wechat_api_corp_id: <string> ]
route:
[ receiver: <string> ]
[ group_by: '[' <labelname>, ... ']' ]
[ continue: <boolean> | default = false ] # 是否匹配子节点
match:
[ <labelname>: <labelvalue>, ... ] # 匹配告警的标签
match_re:
[ <labelname>: <regex>, ... ] # 正则匹配告警的标签
[ group_wait: <duration> | default = 30s ] # 等待时间内当前group接收到了新的告警,一起发送
[ group_interval: <duration> | default = 5m ] # 当组内已经发送过一个告警,组内若有新增告警需要等待的时间,默认为5m,这条要确定组内信息是影响同一业务才能设置,若分组不合理,可能导致告警延迟,造成影响
[ repeat_interval: <duration> | default = 4h ] # 发送重复警报的周期
routes:
[ - <route> ... ]
receivers:
- name: <string>
- webhook_configs:
- wechat_configs:
templates:
- '/etc/alertmanager/config/*.tmpl'
inhibit_rules: # 抑制机制可以避免当某种问题告警产生之后用户接收到大量由此问题导致的一系列的其它告警通知。
- source_match:
[ <labelname>: <labelvalue>, ... ]
source_match_re:
[ <labelname>: <regex>, ... ]
target_match:
[ <labelname>: <labelvalue>, ... ]
target_match_re:
[ <labelname>: <regex>, ... ]
equal:
- namespace
- alertname
2.1 告警模板
2.2 抑制示例
1
2
3
4
5
6
7
8
inhibit_rules:
- source_match:
severity: critical
target_match_re:
severity: warning|info
equal:
- namespace
- alertname
解释:当收到severity: critical
级别的告警后,会抑制相同namespace
、alertname
值的severity: warning|info
级别告警
3. prometheus rules
1
2
3
4
5
6
7
8
9
10
11
- name: addRules
rules:
- alert: MemoryThrottlingHigh
annotations:
description: namespace for container in pod 内存使用率超过80%(当前值:)
summary: 内存使用率过高
expr: |
sum(node_namespace_pod_container:container_memory_working_set_bytes{container!=""}) by (container, pod, namespace) / sum(kube_pod_container_resource_limits_memory_bytes) by (container, pod, namespace) * 100 > 80
for: 2m
labels:
severity: info
alert
:告警规则的名称expr
:是用于进行报警规则 PromQL 查询语句for
:评估等待时间(Pending Duration),用于表示只有当触发条件持续一段时间后才发送告警,在等待期间新产生的告警状态为pending
labels
:自定义标签,允许用户指定额外的标签列表,把它们附加在告警上annotations
:指定了另一组标签,它们不被当做告警实例的身份标识,它们经常用于存储一些额外的信息,用于报警信息的展示之类的
其中的 for
字段同样会影响到我们的告警到达时间,该参数用于表示只有当触发条件持续一段时间后才发送告警,在等待期间新产生的告警状态为pending
,这个参数主要用于降噪,很多类似响应时间这样的指标都是有抖动的