prometheus 搭建
- 运行以下命令下载 prometheus, 并做解压等动作
mkdir -p /opt/monitor && cd /opt/monitor wget https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz tar xvf prometheus-2.35.0.linux-amd64.tar.gz && mv prometheus-2.35.0.linux-amd64 prometheus mkdir -p /opt/monitor/prometheus/data
- 使用 supervisor 启动 prometheus, supervisor 里面的配置文件如下, 安装 supervisor 参考 supervisor安装:
[program:prometheus] process_name=%(program_name)s command=/opt/monitor/prometheus/prometheus --config.file=/opt/monitor/prometheus/prometheus.yml --storage.tsdb.path=/opt/monitor/prometheus/data --storage.tsdb.retention=60d --log.level=info --web.listen-address="192.168.19.69:9090" autostart=true autorestart=true user=root redirect_stderr=true stdout_logfile=/var/log/supervisor/prometheus.log
- 其中 “/opt/monitor/prometheus/prometheus.yml” 文件内容如下:
# my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "prometheus" # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ["192.168.19.69:9090"]
- 使用 “supervisorctl update” 命令使刚刚添加的配置文件生效, 使用”supervisorctl status” 查看 promethus 是否启动成功
- 使用”http://192.168.19.69:9090/” 访问看是否成功
node_exporter 搭建
- 使用以下命令下载并解压 [program:node_exporter]
mkdir -p /opt/monitor && cd /opt/monitor wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz tar xvf node_exporter-1.3.1.linux-amd64.tar.gz && mv node_exporter-1.3.1.linux-amd64 node_exporter
- 使用 supervisor 启动 3. 其中 “/opt/monitor/prometheus/prometheus.yml” 文件内容如下:, supervisor 里面的配置文件如下, 安装 supervisor 参考 supervisor安装:
[program:node_exporter] process_name=%(program_name)s command=/opt/monitor/node_exporter/node_exporter --web.listen-address="192.168.19.69:9111" --web.config="/opt/monitor/node_exporter/config.yaml" autostart=true autorestart=true user=root redirect_stderr=true stdout_logfile=/var/log/supervisor/node_exporter.log
- 其中“/opt/monitor/node_exporter/config.yaml”的配置文件如,这儿使用了 htpasswd 来生成密码,也可以不使用用户名和密码, 生成密码的参考这篇文章”https://www.cnblogs.com/xjzyy/p/15602929.html”
basic_auth_users: prometheus: your_password
- 使用 “supervisorctl update” 命令使刚刚添加的配置文件生效, 使用”supervisorctl status” 查看 node_exporter 是否启动成功
- 修改 “/opt/monitor/prometheus/prometheus.yml” 文件内容如下, 添加 node_exporter 的配置
# my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "prometheus" # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ["192.168.19.69:9090"] - job_name: '69_node_exporter' scrape_interval: 5s scheme: http basic_auth: username: prometheus password: your_password # 这儿的密码就是 node_export 设置的密码 static_configs: - targets: ['192.168.19.69:9111'] labels: instance: 19.168.19.69
- 使用 “ supervisorctl restart prometheus” 重新启动 prometheus, 使配置文件生效
grafana 安装
- 使用下面的命令安装 grafana
mkdir -p /opt/monitor && cd /opt/monitor wget https://dl.grafana.com/enterprise/release/grafana-enterprise-8.5.0.linux-amd64.tar.gz tar xvf grafana-enterprise-8.5.0.linux-amd64.tar.gz && mv grafana-8.5.0/ grafana cd /opt/monitor/grafana/conf && cp sample.ini grafana.ini
- 使用 supervisor 启动 grafana, supervisor 里面的配置文件如下, 安装 supervisor 参考 supervisor安装:
[program:grafana] process_name=%(program_name)s directory=/opt/monitor/grafana/bin command=/opt/monitor/grafana/bin/grafana-server -config /opt/monitor/grafana/conf/grafana.ini autostart=true autorestart=true user=root redirect_stderr=true stdout_logfile=/var/log/supervisor/grafana.log
- 使用 “supervisorctl update” 命令使刚刚添加的配置文件生效, 使用”supervisorctl status” 查看 grafana 是否启动成功
- 使用”http://192.168.19.69:3000/login” 登录. 默认用户名和密码都是admin,然后点击设置添加数据源, 数据源选择 promethus , url使用“http://192.168.19.59:9090”
- 点击 “+” “import” 输入ID “1860” 选择 “prometheus” 然后导入
alertmanager 安装
- 使用下面的命令安装 alertmanager
mkdir -p /opt/monitor && cd /opt/monitor https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz tar xvf alertmanager-0.24.0.linux-amd64.tar.gz && mv alertmanager-0.24.0.linux-amd64/ alertmanager mkdir -p /opt/monitor/alertmanager/template
- 使用 supervisor 启动 alertmanager, supervisor 里面的配置文件如下, 安装 supervisor 参考 supervisor安装:
[program:alertmanager] process_name=%(program_name)s command=/opt/monitor/alertmanager/alertmanager --config.file="/opt/monitor/alertmanager/alertmanager.yml" --web.listen-address="192.168.19.69:9993" --cluster.listen-address="192.168.19.69:9994" autostart=true autorestart=true user=root redirect_stderr=true stdout_logfile=/var/log/supervisor/alertmanager.log
-
其中”/opt/monitor/alertmanager/alertmanager.yml”里面的文件内容如下:
global: resolve_timeout: 5m templates: - '/opt/monitor/alertmanager/template/test.tmpl' route: group_by: ['alertname'] group_wait: 5s group_interval: 5s repeat_interval: 2m receiver: 'wechat' receivers: - name: 'web.hook' webhook_configs: - url: 'http://127.0.0.1:5001/' - name: 'wechat' # 下面这一段是企业微信的报警设置,需要自己先申请应用 wechat_configs: - send_resolved: true agent_id: 'your_agent' to_user: 'your_name' api_secret: 'your_api_secret' corp_id: 'your_corp_id'
- 其中”/opt/monitor/alertmanager/template/test.tmpl”里面的文件内容如下, 里面是设置的报警格式设置
{{ define "wechat.default.message" }} {{ range $i, $alert :=.Alerts }} ======= 监控报警 ========= 告警状态:{{ .Status }} 告警级别:{{ $alert.Labels.severity }} 告警类型:{{ $alert.Labels.alertname }} 告警应用:{{ $alert.Annotations.summary }} 告警主机:{{ $alert.Labels.instance }} 告警详情:{{ $alert.Annotations.description }} 触发阀值:{{ $alert.Annotations.value }} 触发时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} 恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} ========== end ========== {{ end }} {{ end }}
- 使用 “supervisorctl update” 命令使刚刚添加的配置文件生效, 使用”supervisorctl status” 查看 alertmanager 是否启动成功
- 编辑 “/opt/monitor/prometheus/prometheus.yml” 文件,添加内容,如下所示:
# my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: ['192.168.19.94:9993'] # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "rules/*.yml" # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "prometheus" # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ["192.168.19.69:9090"] - job_name: '69_node_exporter' scrape_interval: 5s scheme: http basic_auth: username: prometheus password: your_password # 这儿的密码就是 node_export 设置的密码 static_configs: - targets: ['192.168.19.69:9111'] labels: instance: 19.168.19.69
- “mkdir -p /opt/monitor/prometheus/rules ” , 并创建三个文件,分别为“dist.yml mem.yml unreachable.yml” 这三个文件可以自定义, 里面定义了报警的规则,三个文件里面的内容如下:
- dist.yml 的内容如下
groups: - name: root_dist_error rules: - alert: "硬盘报警" expr: 100 - (node_filesystem_avail_bytes{device="rootfs",fstype="rootfs",mountpoint="/"} / node_filesystem_size_bytes{device="rootfs",fstype="rootfs",mountpoint="/"}) * 100 > 80 for: 60s labels: severity: error team: testteam annotations: summary: "root disk used is large" description: "根目录使用率大于80%" value: "{{ humanize $value }}%"
- mem.yml 的内容如下:
groups: - name: error_mem rules: - alert: "memory error" expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 85 for: 20s labels: severity: error team: testteam annotations: summary: "Memory Usage is busy" description: "memory usage is lager 80%" value: "{{ humanize $value }}%"
- unreachable.yml 的内容如下:
groups: - name: InstanceDown #同性质的一组报警,监控当前节点的指标的组名称 rules: - alert: InstanceDown expr: up == 0 #每一个实例都会有一个up的状态,up是默认赋予被监控端的一个指标,0为失败状态,1为存活状态 for: 20m #当前报警的持续时间,1分钟之内如果都是up == 0的状态,才会发出报警 labels: #设置报警级别 severity: error #报警级别为error级别 annotations: #注释信息 summary: "Instance {{ $labels.instance }} is down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 10 minutes."
- dist.yml 的内容如下
-
然后使用 ”supervisorctl restart prometheus“ 重启 premetheus ,使改动生效