监控 – 人生其实如草

prometheus 搭建

运行以下命令下载 prometheus, 并做解压等动作

mkdir -p /opt/monitor && cd /opt/monitor
wget https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz
tar xvf prometheus-2.35.0.linux-amd64.tar.gz && mv prometheus-2.35.0.linux-amd64 prometheus
mkdir -p /opt/monitor/prometheus/data

使用 supervisor 启动 prometheus， supervisor 里面的配置文件如下, 安装 supervisor 参考 supervisor安装：

[program:prometheus]
process_name=%(program_name)s
command=/opt/monitor/prometheus/prometheus --config.file=/opt/monitor/prometheus/prometheus.yml --storage.tsdb.path=/opt/monitor/prometheus/data --storage.tsdb.retention=60d --log.level=info --web.listen-address="192.168.19.69:9090" 
autostart=true
autorestart=true
user=root
redirect_stderr=true
stdout_logfile=/var/log/supervisor/prometheus.log

其中 “/opt/monitor/prometheus/prometheus.yml” 文件内容如下：

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["192.168.19.69:9090"]

使用 “supervisorctl update” 命令使刚刚添加的配置文件生效, 使用”supervisorctl status” 查看 promethus 是否启动成功
使用”http://192.168.19.69:9090/” 访问看是否成功

node_exporter 搭建

使用以下命令下载并解压 [program:node_exporter]

mkdir -p /opt/monitor && cd /opt/monitor
wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
tar xvf node_exporter-1.3.1.linux-amd64.tar.gz && mv node_exporter-1.3.1.linux-amd64 node_exporter

使用 supervisor 启动 3. 其中 “/opt/monitor/prometheus/prometheus.yml” 文件内容如下：， supervisor 里面的配置文件如下, 安装 supervisor 参考 supervisor安装：

[program:node_exporter]
process_name=%(program_name)s
command=/opt/monitor/node_exporter/node_exporter  --web.listen-address="192.168.19.69:9111"  --web.config="/opt/monitor/node_exporter/config.yaml"
autostart=true
autorestart=true
user=root
redirect_stderr=true
stdout_logfile=/var/log/supervisor/node_exporter.log

其中“/opt/monitor/node_exporter/config.yaml”的配置文件如，这儿使用了 htpasswd 来生成密码，也可以不使用用户名和密码，生成密码的参考这篇文章”https://www.cnblogs.com/xjzyy/p/15602929.html”
```
basic_auth_users:
  prometheus: your_password
```
使用 “supervisorctl update” 命令使刚刚添加的配置文件生效, 使用”supervisorctl status” 查看 node_exporter 是否启动成功

修改 “/opt/monitor/prometheus/prometheus.yml” 文件内容如下，添加 node_exporter 的配置

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["192.168.19.69:9090"]

  - job_name: '69_node_exporter'
    scrape_interval: 5s
    scheme: http
    basic_auth:
      username: prometheus
      password: your_password # 这儿的密码就是 node_export 设置的密码
    static_configs:
    - targets: ['192.168.19.69:9111']
      labels:
        instance: 19.168.19.69

使用 “ supervisorctl restart prometheus” 重新启动 prometheus, 使配置文件生效

grafana 安装

使用下面的命令安装 grafana

mkdir -p /opt/monitor && cd /opt/monitor
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-8.5.0.linux-amd64.tar.gz
tar xvf grafana-enterprise-8.5.0.linux-amd64.tar.gz && mv grafana-8.5.0/ grafana
cd /opt/monitor/grafana/conf && cp sample.ini grafana.ini

使用 supervisor 启动 grafana， supervisor 里面的配置文件如下, 安装 supervisor 参考 supervisor安装：

[program:grafana]
process_name=%(program_name)s
directory=/opt/monitor/grafana/bin
command=/opt/monitor/grafana/bin/grafana-server -config /opt/monitor/grafana/conf/grafana.ini
autostart=true
autorestart=true
user=root
redirect_stderr=true
stdout_logfile=/var/log/supervisor/grafana.log

使用 “supervisorctl update” 命令使刚刚添加的配置文件生效, 使用”supervisorctl status” 查看 grafana 是否启动成功
使用”http://192.168.19.69:3000/login” 登录. 默认用户名和密码都是admin,然后点击设置添加数据源, 数据源选择 promethus ， url使用“http://192.168.19.59:9090”
点击 “+” “import” 输入ID “1860” 选择 “prometheus” 然后导入

alertmanager 安装

使用下面的命令安装 alertmanager

mkdir -p /opt/monitor && cd /opt/monitor
https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
tar xvf alertmanager-0.24.0.linux-amd64.tar.gz && mv alertmanager-0.24.0.linux-amd64/ alertmanager
mkdir -p /opt/monitor/alertmanager/template

使用 supervisor 启动 alertmanager， supervisor 里面的配置文件如下, 安装 supervisor 参考 supervisor安装：

[program:alertmanager]
process_name=%(program_name)s
command=/opt/monitor/alertmanager/alertmanager --config.file="/opt/monitor/alertmanager/alertmanager.yml"   --web.listen-address="192.168.19.69:9993"  --cluster.listen-address="192.168.19.69:9994"
autostart=true
autorestart=true
user=root
redirect_stderr=true
stdout_logfile=/var/log/supervisor/alertmanager.log

其中”/opt/monitor/alertmanager/alertmanager.yml”里面的文件内容如下：

global:
  resolve_timeout: 5m

templates:
  - '/opt/monitor/alertmanager/template/test.tmpl'

route:
  group_by: ['alertname']
  group_wait: 5s
  group_interval: 5s
  repeat_interval: 2m
  receiver: 'wechat'

receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
- name: 'wechat'  # 下面这一段是企业微信的报警设置，需要自己先申请应用
  wechat_configs:
    - send_resolved: true
      agent_id: 'your_agent'
      to_user: 'your_name'
      api_secret: 'your_api_secret'
      corp_id: 'your_corp_id'

其中”/opt/monitor/alertmanager/template/test.tmpl”里面的文件内容如下，里面是设置的报警格式设置

{{ define "wechat.default.message" }}
{{ range $i, $alert :=.Alerts }}
=======  监控报警  =========
告警状态：{{ .Status }}
告警级别：{{ $alert.Labels.severity }}
告警类型：{{ $alert.Labels.alertname }}
告警应用：{{ $alert.Annotations.summary }}
告警主机：{{ $alert.Labels.instance }}
告警详情：{{ $alert.Annotations.description }}
触发阀值：{{ $alert.Annotations.value }}
触发时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
==========  end  ========== 
{{ end }} 
{{ end }}

使用 “supervisorctl update” 命令使刚刚添加的配置文件生效, 使用”supervisorctl status” 查看 alertmanager 是否启动成功

编辑 “/opt/monitor/prometheus/prometheus.yml” 文件，添加内容，如下所示:

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['192.168.19.94:9993']

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["192.168.19.69:9090"]

  - job_name: '69_node_exporter'
    scrape_interval: 5s
    scheme: http
    basic_auth:
      username: prometheus
      password: your_password # 这儿的密码就是 node_export 设置的密码
    static_configs:
    - targets: ['192.168.19.69:9111']
      labels:
        instance: 19.168.19.69

“mkdir -p /opt/monitor/prometheus/rules ” , 并创建三个文件，分别为“dist.yml mem.yml unreachable.yml” 这三个文件可以自定义，里面定义了报警的规则，三个文件里面的内容如下：

dist.yml 的内容如下

groups:
- name: root_dist_error
  rules:
  - alert: "硬盘报警"
    expr: 100 - (node_filesystem_avail_bytes{device="rootfs",fstype="rootfs",mountpoint="/"} / node_filesystem_size_bytes{device="rootfs",fstype="rootfs",mountpoint="/"}) * 100 > 80
    for: 60s
    labels:
      severity: error 
      team: testteam
    annotations:
      summary: "root disk used is large"
      description: "根目录使用率大于80%"
      value: "{{ humanize $value }}%"

mem.yml 的内容如下：

groups:
  - name: error_mem
    rules:
    - alert: "memory error"
      expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 85
      for: 20s
      labels:
        severity: error 
        team: testteam
      annotations:
        summary: "Memory Usage is busy"
        description: "memory usage is lager 80%"
        value: "{{ humanize $value }}%"

unreachable.yml 的内容如下：

groups: 
  - name: InstanceDown #同性质的一组报警，监控当前节点的指标的组名称
    rules:
    - alert: InstanceDown
      expr: up == 0 #每一个实例都会有一个up的状态，up是默认赋予被监控端的一个指标，0为失败状态，1为存活状态
      for: 20m #当前报警的持续时间，1分钟之内如果都是up == 0的状态，才会发出报警
      labels: #设置报警级别
        severity: error #报警级别为error级别
      annotations: #注释信息
        summary: "Instance {{ $labels.instance }} is down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 10 minutes."

然后使用 ”supervisorctl restart prometheus“ 重启 premetheus ，使改动生效

2025 年 12 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

标签归档：监控

prometheus 搭建

node_exporter 搭建

grafana 安装

alertmanager 安装