prometheus 监控体系搭建

prometheus 搭建

  1. 运行以下命令下载 prometheus, 并做解压等动作
    mkdir -p /opt/monitor && cd /opt/monitor
    wget https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz
    tar xvf prometheus-2.35.0.linux-amd64.tar.gz && mv prometheus-2.35.0.linux-amd64 prometheus
    mkdir -p /opt/monitor/prometheus/data
    
  2. 使用 supervisor 启动 prometheus, supervisor 里面的配置文件如下, 安装 supervisor 参考 supervisor安装
    [program:prometheus]
    process_name=%(program_name)s
    command=/opt/monitor/prometheus/prometheus --config.file=/opt/monitor/prometheus/prometheus.yml --storage.tsdb.path=/opt/monitor/prometheus/data --storage.tsdb.retention=60d --log.level=info --web.listen-address="192.168.19.69:9090" 
    autostart=true
    autorestart=true
    user=root
    redirect_stderr=true
    stdout_logfile=/var/log/supervisor/prometheus.log
    
  3. 其中 “/opt/monitor/prometheus/prometheus.yml” 文件内容如下:
    # my global config
    global:
      scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      # scrape_timeout is set to the global default (10s).
    
    # Alertmanager configuration
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
              # - alertmanager:9093
    
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
      # - "first_rules.yml"
      # - "second_rules.yml"
    
    # A scrape configuration containing exactly one endpoint to scrape:
    # Here it's Prometheus itself.
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: "prometheus"
    
        # metrics_path defaults to '/metrics'
        # scheme defaults to 'http'.
    
        static_configs:
          - targets: ["192.168.19.69:9090"]
    
  4. 使用 “supervisorctl update” 命令使刚刚添加的配置文件生效, 使用”supervisorctl status” 查看 promethus 是否启动成功
  5. 使用”http://192.168.19.69:9090/” 访问看是否成功

node_exporter 搭建

  1. 使用以下命令下载并解压 [program:node_exporter]
    mkdir -p /opt/monitor && cd /opt/monitor
    wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
    tar xvf node_exporter-1.3.1.linux-amd64.tar.gz && mv node_exporter-1.3.1.linux-amd64 node_exporter
    
  2. 使用 supervisor 启动 3. 其中 “/opt/monitor/prometheus/prometheus.yml” 文件内容如下:, supervisor 里面的配置文件如下, 安装 supervisor 参考 supervisor安装
    [program:node_exporter]
    process_name=%(program_name)s
    command=/opt/monitor/node_exporter/node_exporter  --web.listen-address="192.168.19.69:9111"  --web.config="/opt/monitor/node_exporter/config.yaml"
    autostart=true
    autorestart=true
    user=root
    redirect_stderr=true
    stdout_logfile=/var/log/supervisor/node_exporter.log
    
  3. 其中“/opt/monitor/node_exporter/config.yaml”的配置文件如,这儿使用了 htpasswd 来生成密码,也可以不使用用户名和密码, 生成密码的参考这篇文章”https://www.cnblogs.com/xjzyy/p/15602929.html”
    basic_auth_users:
      prometheus: your_password
    
  4. 使用 “supervisorctl update” 命令使刚刚添加的配置文件生效, 使用”supervisorctl status” 查看 node_exporter 是否启动成功
  5. 修改 “/opt/monitor/prometheus/prometheus.yml” 文件内容如下, 添加 node_exporter 的配置
    # my global config
    global:
      scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      # scrape_timeout is set to the global default (10s).
    
    # Alertmanager configuration
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
              # - alertmanager:9093
    
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
      # - "first_rules.yml"
      # - "second_rules.yml"
    
    # A scrape configuration containing exactly one endpoint to scrape:
    # Here it's Prometheus itself.
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: "prometheus"
    
        # metrics_path defaults to '/metrics'
        # scheme defaults to 'http'.
    
        static_configs:
          - targets: ["192.168.19.69:9090"]
    
      - job_name: '69_node_exporter'
        scrape_interval: 5s
        scheme: http
        basic_auth:
          username: prometheus
          password: your_password # 这儿的密码就是 node_export 设置的密码
        static_configs:
        - targets: ['192.168.19.69:9111']
          labels:
            instance: 19.168.19.69
    
  6. 使用 “ supervisorctl restart prometheus” 重新启动 prometheus, 使配置文件生效

grafana 安装
  1. 使用下面的命令安装 grafana
    mkdir -p /opt/monitor && cd /opt/monitor
    wget https://dl.grafana.com/enterprise/release/grafana-enterprise-8.5.0.linux-amd64.tar.gz
    tar xvf grafana-enterprise-8.5.0.linux-amd64.tar.gz && mv grafana-8.5.0/ grafana
    cd /opt/monitor/grafana/conf && cp sample.ini grafana.ini
    
  2. 使用 supervisor 启动 grafana, supervisor 里面的配置文件如下, 安装 supervisor 参考 supervisor安装
    [program:grafana]
    process_name=%(program_name)s
    directory=/opt/monitor/grafana/bin
    command=/opt/monitor/grafana/bin/grafana-server -config /opt/monitor/grafana/conf/grafana.ini
    autostart=true
    autorestart=true
    user=root
    redirect_stderr=true
    stdout_logfile=/var/log/supervisor/grafana.log
    
  3. 使用 “supervisorctl update” 命令使刚刚添加的配置文件生效, 使用”supervisorctl status” 查看 grafana 是否启动成功
  4. 使用”http://192.168.19.69:3000/login” 登录. 默认用户名和密码都是admin,然后点击设置添加数据源, 数据源选择 promethus , url使用“http://192.168.19.59:9090”
  5. 点击 “+” “import” 输入ID “1860” 选择 “prometheus” 然后导入
alertmanager 安装
  1. 使用下面的命令安装 alertmanager
    mkdir -p /opt/monitor && cd /opt/monitor
    https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
    tar xvf alertmanager-0.24.0.linux-amd64.tar.gz && mv alertmanager-0.24.0.linux-amd64/ alertmanager
    mkdir -p /opt/monitor/alertmanager/template
    
  2. 使用 supervisor 启动 alertmanager, supervisor 里面的配置文件如下, 安装 supervisor 参考 supervisor安装
    [program:alertmanager]
    process_name=%(program_name)s
    command=/opt/monitor/alertmanager/alertmanager --config.file="/opt/monitor/alertmanager/alertmanager.yml"   --web.listen-address="192.168.19.69:9993"  --cluster.listen-address="192.168.19.69:9994"
    autostart=true
    autorestart=true
    user=root
    redirect_stderr=true
    stdout_logfile=/var/log/supervisor/alertmanager.log
    
  3. 其中”/opt/monitor/alertmanager/alertmanager.yml”里面的文件内容如下:

    global:
      resolve_timeout: 5m
    
    templates:
      - '/opt/monitor/alertmanager/template/test.tmpl'
    
    route:
      group_by: ['alertname']
      group_wait: 5s
      group_interval: 5s
      repeat_interval: 2m
      receiver: 'wechat'
    
    receivers:
    - name: 'web.hook'
      webhook_configs:
      - url: 'http://127.0.0.1:5001/'
    - name: 'wechat'  # 下面这一段是企业微信的报警设置,需要自己先申请应用
      wechat_configs:
        - send_resolved: true
          agent_id: 'your_agent'
          to_user: 'your_name'
          api_secret: 'your_api_secret'
          corp_id: 'your_corp_id'
    
  4. 其中”/opt/monitor/alertmanager/template/test.tmpl”里面的文件内容如下, 里面是设置的报警格式设置
    {{ define "wechat.default.message" }}
    {{ range $i, $alert :=.Alerts }}
    =======  监控报警  =========
    告警状态:{{ .Status }}
    告警级别:{{ $alert.Labels.severity }}
    告警类型:{{ $alert.Labels.alertname }}
    告警应用:{{ $alert.Annotations.summary }}
    告警主机:{{ $alert.Labels.instance }}
    告警详情:{{ $alert.Annotations.description }}
    触发阀值:{{ $alert.Annotations.value }}
    触发时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
    ==========  end  ========== 
    {{ end }} 
    {{ end }}
    
  5. 使用 “supervisorctl update” 命令使刚刚添加的配置文件生效, 使用”supervisorctl status” 查看 alertmanager 是否启动成功
  6. 编辑 “/opt/monitor/prometheus/prometheus.yml” 文件,添加内容,如下所示:
    # my global config
    global:
      scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      # scrape_timeout is set to the global default (10s).
    
    # Alertmanager configuration
    alerting:
      alertmanagers:
        - static_configs:
            - targets: ['192.168.19.94:9993']
    
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
      - "rules/*.yml"
      # - "first_rules.yml"
      # - "second_rules.yml"
    
    # A scrape configuration containing exactly one endpoint to scrape:
    # Here it's Prometheus itself.
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: "prometheus"
    
        # metrics_path defaults to '/metrics'
        # scheme defaults to 'http'.
    
        static_configs:
          - targets: ["192.168.19.69:9090"]
    
      - job_name: '69_node_exporter'
        scrape_interval: 5s
        scheme: http
        basic_auth:
          username: prometheus
          password: your_password # 这儿的密码就是 node_export 设置的密码
        static_configs:
        - targets: ['192.168.19.69:9111']
          labels:
            instance: 19.168.19.69
    
  7. “mkdir -p /opt/monitor/prometheus/rules ” , 并创建三个文件,分别为“dist.yml mem.yml unreachable.yml” 这三个文件可以自定义, 里面定义了报警的规则,三个文件里面的内容如下:
    • dist.yml 的内容如下
      groups:
      - name: root_dist_error
        rules:
        - alert: "硬盘报警"
          expr: 100 - (node_filesystem_avail_bytes{device="rootfs",fstype="rootfs",mountpoint="/"} / node_filesystem_size_bytes{device="rootfs",fstype="rootfs",mountpoint="/"}) * 100 > 80
          for: 60s
          labels:
            severity: error 
            team: testteam
          annotations:
            summary: "root disk used is large"
            description: "根目录使用率大于80%"
            value: "{{ humanize $value }}%"
      
    • mem.yml 的内容如下:
      groups:
        - name: error_mem
          rules:
          - alert: "memory error"
            expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 85
            for: 20s
            labels:
              severity: error 
              team: testteam
            annotations:
              summary: "Memory Usage is busy"
              description: "memory usage is lager 80%"
              value: "{{ humanize $value }}%"
      
    • unreachable.yml 的内容如下:
      groups: 
        - name: InstanceDown #同性质的一组报警,监控当前节点的指标的组名称
          rules:
          - alert: InstanceDown
            expr: up == 0 #每一个实例都会有一个up的状态,up是默认赋予被监控端的一个指标,0为失败状态,1为存活状态
            for: 20m #当前报警的持续时间,1分钟之内如果都是up == 0的状态,才会发出报警
            labels: #设置报警级别
              severity: error #报警级别为error级别
            annotations: #注释信息
              summary: "Instance {{ $labels.instance }} is down"
              description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 10 minutes."
      
  8. 然后使用 ”supervisorctl restart prometheus“ 重启 premetheus ,使改动生效