Prometheus 是一套开源的系统监控报警框架, 广泛用于 Kubernetes 集群的监控系统

1. Prometheus组成及架构

Prometheus 生态圈中包含了多个组件,其中许多组件是可选的:

  • Prometheus Server: 用于收集和存储时间序列数据。
  • Client Library: 客户端库,为需要监控的服务生成相应的 metrics 并暴露给 Prometheus server。当 Prometheus server 来 pull 时,直接返回实时状态的 metrics。
  • Push Gateway: 主要用于短期的 jobs。由于这类 jobs 存在时间较短,可能在 Prometheus 来 pull 之前就消失了。为此,这次 jobs 可以直接向 Prometheus server 端推送它们的 metrics。这种方式主要用于服务层面的 metrics,对于机器层面的 metrices,需要使用 node exporter。
  • Exporters: 用于暴露已有的第三方服务的 metrics 给 Prometheus。
  • Alertmanager: 从 Prometheus server 端接收到 alerts 后,会进行去除重复数据,分组,并路由到对收的接受方式,发出报警。常见的接收方式有:电子邮件,pagerduty,OpsGenie, webhook 等。
  • 一些其他的工具。

架构图

img

工作流程:

  1. Prometheus server 定期从配置好的 jobs 或者 exporters 中拉 metrics,或者接收来自 Pushgateway 发过来的 metrics,或者从其他的 Prometheus server 中拉 metrics。
  2. Prometheus server 在本地存储收集到的 metrics,并运行已定义好的 alert.rules,记录新的时间序列或者向 Alertmanager 推送警报。
  3. Alertmanager 根据配置文件,对接收到的警报进行处理,发出告警。
  4. 在图形界面中,可视化采集数据。

2. Prometheus 安装

  • 下载Prometheus安装包及node_expoter安装包并解压

    1
    $ wget https://github.com/prometheus/prometheus/releases/download/v2.10.0/prometheus-2.10.0.linux-amd64.tar.gz 
    2
    (python36env) [root@localhost src]# tar xf prometheus-2.10.0.linux-amd64.tar.gz
    1
    $ wget https://github.com/prometheus/node_exporter/releases/download/v0.18.0/node_exporter-0.18.0.linux-amd64.tar.gz
    2
    (python36env) [root@localhost src]# tar xf node_exporter-0.18.0.linux-amd64.tar.gz
  • 启动prometheus

    默认情况下,直接执行prometheus目录下的Prometheus即可启动服务

    1
    (python36env) [root@localhost src]# cd prometheus-2.10.0.linux-amd64
    2
    (python36env) [root@localhost prometheus-2.10.0.linux-amd64]# ./prometheus
    3
    level=info ts=2019-05-28T03:32:08.501Z caller=main.go:286 msg="no time or size retention was set so using thedefault time retention" duration=15d
    4
    level=info ts=2019-05-28T03:32:08.501Z caller=main.go:322 msg="Starting Prometheus" version="(version=2.10.0,branch=HEAD, revision=d20e84d0fb64aff2f62a977adc8cfb656da4e286)"
    5
    level=info ts=2019-05-28T03:32:08.501Z caller=main.go:323 build_context="(go=go1.12.5, user=root@a49185acd9b0, date=20190525-12:28:13)"
    6
    level=info ts=2019-05-28T03:32:08.501Z caller=main.go:324 host_details="(Linux 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 localhost.localdomain (none))"
    7
    level=info ts=2019-05-28T03:32:08.501Z caller=main.go:325 fd_limits="(soft=1024, hard=4096)"
    8
    level=info ts=2019-05-28T03:32:08.501Z caller=main.go:326 vm_limits="(soft=unlimited, hard=unlimited)"
    9
    level=info ts=2019-05-28T03:32:08.503Z caller=main.go:645 msg="Starting TSDB ..."
    10
    level=info ts=2019-05-28T03:32:08.579Z caller=web.go:417 component=web msg="Start listening for connections" address=0.0.0.0:9090
    11
    level=info ts=2019-05-28T03:32:08.584Z caller=main.go:660 fs_type=XFS_SUPER_MAGIC
    12
    level=info ts=2019-05-28T03:32:08.584Z caller=main.go:661 msg="TSDB started"
    13
    level=info ts=2019-05-28T03:32:08.584Z caller=main.go:730 msg="Loading configuration file" filename=prometheus.yml
    14
    level=info ts=2019-05-28T03:32:08.587Z caller=main.go:758 msg="Completed loading of configuration file" filename=prometheus.yml
    15
    level=info ts=2019-05-28T03:32:08.588Z caller=main.go:614 msg="Server is ready to receive web requests.

    服务器地址 http://localhost:9090/

    1559014446298

    1559014477687

    默认情况下,Prometheus会读取默认配置 , 默认配置如下:

    prometheus.yml

    1
    # my global config
    2
    global: # 全局设置,可以被覆盖
    3
      scrape_interval:     15s # 默认值为 15s,用于设置每次数据收集的间隔
    4
      evaluation_interval: 15s # 所有时间序列和警告与外部通信时用的外部标签
    5
      # scrape_timeout is set to the global default (10s).
    6
    7
    # Alertmanager configuration
    8
    alerting:
    9
      alertmanagers:
    10
      - static_configs:
    11
        - targets:
    12
          # - alertmanager:9093
    13
    14
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    15
    rule_files:  # 	警告规则设置文件
    16
      # - "first_rules.yml"
    17
      # - "second_rules.yml"
    18
    19
    # A scrape configuration containing exactly one endpoint to scrape:
    20
    # Here it's Prometheus itself.
    21
    # 用于配置 scrape 的 endpoint  配置需要 scrape 的 targets 以及相应的参数
    22
    scrape_configs:
    23
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
    24
      - job_name: 'prometheus'  
    25
     # 一定要全局唯一, 采集 Prometheus 自身的 metrics
    26
    27
        # metrics_path defaults to '/metrics'
    28
        # scheme defaults to 'http'.
    29
    30
        static_configs: # 静态目标的配置 
    31
        - targets: ['localhost:9090']

    以守护进程方式运行

    1
    $ daemonize -c /data/prometheus/ /data/prometheus/up.sh
    2
    $ cat  /data/prometheus/up.sh
    3
    /data/prometheus/prometheus --web.listenaddress="0.0.0.0:9090"        --web.read-timeout=5m    --web.maxconnections=10  --storage.tsdb.retention=15d  --
    4
    storage.tsdb.path="data/"   --query.max-concurrency=20   --
    5
    query.timeout=2m

    参数说明

    1
    --web.read-timeout=5m  Maximum duration before timing out read of the request, and closing idle connections.请求链接的最⼤等待时间 prometheus process -> GET PUSH 防⽌太多的空闲链接 占⽤资源
    2
    3
    --web.max-connections=512  Maximum number of simultaneous connections.
    4
    最⼤链接数
    5
    6
    --storage.tsdb.retention=15d  How long to retain samples in the storage.
    7
    prometheus开始采集监控数据后 会存在内存中和硬盘中对于保留期限的设置 很重要 太长的话 硬盘和内存都吃不消 太短的话 要查历史数据就没有了 企业中设置 15天为宜
    8
    9
    --storage.tsdb.path="data/"  Base path for metrics storage.
    10
    存储数据路径 这个也很重要,不要随便放在⼀个地⽅就执⾏ 会把/ 根⽬录塞满了
    11
    12
    --query.timeout=2m     Maximum time a query may take before being aborted.
    13
    14
    --query.max-concurrency=20   Maximum number of queries executed
    15
    concurrently.
  • 启动node_exporter

    同样的, node_exporter 也不需要配置即可直接执行

    1
    (python36env) [root@localhost node_exporter-0.18.0.linux-amd64]# nohup ./node_exporter &
    2
    (python36env) [root@localhost node_exporter-0.18.0.linux-amd64]# cat nohup.out
    3
    time="2019-05-28T11:53:51+08:00" level=info msg="Starting node_exporter (version=0.18.0, branch=HEAD, revision=f97f01c46cfde2ff97b5539b7964f3044c04947b)" source="node_exporter.go:156"
    4
    time="2019-05-28T11:53:51+08:00" level=info msg="Build context (go=go1.12.5, user=root@77cb1854c0b0, date=20190509-23:12:18)" source="node_exporter.go:157"
    5
    time="2019-05-28T11:53:51+08:00" level=info msg="Enabled collectors:" source="node_exporter.go:97"
    6
    time="2019-05-28T11:53:51+08:00" level=info msg=" - arp" source="node_exporter.go:104"
    7
    time="2019-05-28T11:53:51+08:00" level=info msg=" - bcache" source="node_exporter.go:104"
    8
    time="2019-05-28T11:53:51+08:00" level=info msg=" - bonding" source="node_exporter.go:104"
    9
    time="2019-05-28T11:53:51+08:00" level=info msg=" - conntrack" source="node_exporter.go:104"
    10
    time="2019-05-28T11:53:51+08:00" level=info msg=" - cpu" source="node_exporter.go:104"
    11
    time="2019-05-28T11:53:51+08:00" level=info msg=" - cpufreq" source="node_exporter.go:104"
    12
    time="2019-05-28T11:53:51+08:00" level=info msg=" - diskstats" source="node_exporter.go:104"
    13
    time="2019-05-28T11:53:51+08:00" level=info msg=" - edac" source="node_exporter.go:104"
    14
    time="2019-05-28T11:53:51+08:00" level=info msg=" - entropy" source="node_exporter.go:104"
    15
    time="2019-05-28T11:53:51+08:00" level=info msg=" - filefd" source="node_exporter.go:104"
    16
    time="2019-05-28T11:53:51+08:00" level=info msg=" - filesystem" source="node_exporter.go:104"
    17
    time="2019-05-28T11:53:51+08:00" level=info msg=" - hwmon" source="node_exporter.go:104"
    18
    time="2019-05-28T11:53:51+08:00" level=info msg=" - infiniband" source="node_exporter.go:104"
    19
    time="2019-05-28T11:53:51+08:00" level=info msg=" - ipvs" source="node_exporter.go:104"
    20
    time="2019-05-28T11:53:51+08:00" level=info msg=" - loadavg" source="node_exporter.go:104"
    21
    time="2019-05-28T11:53:51+08:00" level=info msg=" - mdadm" source="node_exporter.go:104"
    22
    time="2019-05-28T11:53:51+08:00" level=info msg=" - meminfo" source="node_exporter.go:104"
    23
    time="2019-05-28T11:53:51+08:00" level=info msg=" - netclass" source="node_exporter.go:104"
    24
    time="2019-05-28T11:53:51+08:00" level=info msg=" - netdev" source="node_exporter.go:104"
    25
    time="2019-05-28T11:53:51+08:00" level=info msg=" - netstat" source="node_exporter.go:104"
    26
    time="2019-05-28T11:53:51+08:00" level=info msg=" - nfs" source="node_exporter.go:104"
    27
    time="2019-05-28T11:53:51+08:00" level=info msg=" - nfsd" source="node_exporter.go:104"
    28
    time="2019-05-28T11:53:51+08:00" level=info msg=" - pressure" source="node_exporter.go:104"
    29
    time="2019-05-28T11:53:51+08:00" level=info msg=" - sockstat" source="node_exporter.go:104"
    30
    time="2019-05-28T11:53:51+08:00" level=info msg=" - stat" source="node_exporter.go:104"
    31
    time="2019-05-28T11:53:51+08:00" level=info msg=" - textfile" source="node_exporter.go:104"
    32
    time="2019-05-28T11:53:51+08:00" level=info msg=" - time" source="node_exporter.go:104"
    33
    time="2019-05-28T11:53:51+08:00" level=info msg=" - timex" source="node_exporter.go:104"
    34
    time="2019-05-28T11:53:51+08:00" level=info msg=" - uname" source="node_exporter.go:104"
    35
    time="2019-05-28T11:53:51+08:00" level=info msg=" - vmstat" source="node_exporter.go:104"
    36
    time="2019-05-28T11:53:51+08:00" level=info msg=" - xfs" source="node_exporter.go:104"
    37
    time="2019-05-28T11:53:51+08:00" level=info msg=" - zfs" source="node_exporter.go:104"
    38
    time="2019-05-28T11:53:51+08:00" level=info msg="Listening on :9100" source="node_exporter.go:170"

    访问 http://localhost:9100/metrics 可以看到数据

    1
    (python36env) [root@localhost node_exporter-0.18.0.linux-amd64]# curl   http://localhost:9100/metrics 
    2
    # HELP go_gc_duration_seconds A summary of the GC invocation durations.
    3
    # TYPE go_gc_duration_seconds summary
    4
    go_gc_duration_seconds{quantile="0"} 0
    5
    go_gc_duration_seconds{quantile="0.25"} 0
    6
    go_gc_duration_seconds{quantile="0.5"} 0
    7
    go_gc_duration_seconds{quantile="0.75"} 0
    8
    go_gc_duration_seconds{quantile="1"} 0
    9
    go_gc_duration_seconds_sum 0
    10
    go_gc_duration_seconds_count 0
    11
    # HELP go_goroutines Number of goroutines that currently exist.
    12
    数据较多 不一一展示
  • 配置prometheus 收集 node_exporter数据(pull模式)

    vim prometheus.yml

    1
    # my global config
    2
    global:
    3
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
    4
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
    5
      # scrape_timeout is set to the global default (10s).
    6
    7
    # Alertmanager configuration
    8
    alerting:
    9
      alertmanagers:
    10
      - static_configs:
    11
        - targets:
    12
          # - alertmanager:9093
    13
    14
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    15
    rule_files:
    16
      # - "first_rules.yml"
    17
      # - "second_rules.yml"
    18
    19
    # A scrape configuration containing exactly one endpoint to scrape:
    20
    # Here it's Prometheus itself.
    21
    scrape_configs:
    22
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
    23
      - job_name: 'prometheus'
    24
    25
        # metrics_path defaults to '/metrics'
    26
        # scheme defaults to 'http'.
    27
    28
        static_configs:
    29
        - targets: ['localhost:9090']
    30
    31
      - job_name: 'node' # 必须全局唯一
    32
    33
        # metrics_path defaults to '/metrics'
    34
        # scheme defaults to 'http'.
    35
    36
        static_configs:
    37
        - targets: ['localhost:9100']

    访问 http://localhost:9090/查看

    1559015956232

3. Prometheus docker安装

1
$ docker image pull prom/prometheus

启动 Prometheus 容器,并把服务绑定在本机的 9090 端口

1
(python36env) [root@localhost ~]# docker run -d -p 9090:9090 \
2
> -v /usr/local/src/prometheus-2.10.0.linux-amd64/prometheus.yml:/etc/prometheus/prometheus.yml \
3
> --name prometheus \
4
> --net=host \
5
> prom/prometheus
6
WARNING: Published ports are discarded when using host network mode
7
f138f3ade04bc45d564413460767a92c32640947fe7874044d1e40c3495799de
8
(python36env) [root@localhost ~]# docker container ps
9
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
10
f138f3ade04b        prom/prometheus     "/bin/prometheus --c…"   12 seconds ago      Up 11 seconds               prometheus

访问 http://localhost:9090/

1559024255541