Prometheus监控实战系列二十五： Thanos介绍

唐僧骑白马

3016人浏览 · 2023-03-27 08:47:00

唐僧骑白马 · 2023-03-27 08:47:00 发布

在前面的文章中，我们介绍了Prometheus的高可用方案。但就如文中所言，目前官方的方案多少还存在着一些问题，尤其是在大规模监控环境中的应用并不完美。鉴于Prometheus的火爆，目前市面上已有不少第三方的开源解决方案用于完善高可用、高并发和数据持久化等问题。

Thanos(灭霸）即是其中之一，也是目前较为流行的解决方案，本文将对其进行介绍。

1、产品介绍

Thanos为英国游戏技术公司Improbable 开源的一套监控解决方案，它包含多个功能组件，可以使用无侵入的方式与Prometheus配合部署，从而实现全局查询、跨集群存储等能力，能够较好地的提升Prometheus的高可用性与扩展性。

源码地址仓库：https://github.com/thanos-io/thanos

官网地址：https://thanos.io/
在这里插入图片描述
该产品具有以下特点：

1、可实现跨集群的全局查询功能；

2、兼容现有的Prometheus API 接口，从而实现无缝集成；

3、提供数据压缩和降准采样功能，提升查询速度；

4、重复数据删除与合并，可从Pormetheus HA 集群中收集指标；

5、支持多种对象存储系统，包括S3、微软Azure、腾讯COS、Google GCP、Openstack Swift 等，可支撑大规模数据的长期存储；

2、功能组件

Thanos包含以下主要功能组件：

Sidecar（边车组件）

Thanos通过该组件实现与Prometheus的集成，配置Sidecar连接Prometheus后，可读取数据给到Querier进行实时查询。另外，通过Sidecar还可以将Prometheus采集的数据上传到对象存储进行保存。

该组件必须与Prometheus运行在同一台机器或同一个Pod中。

Querier（查询组件）

该组件具有与Prometheus兼容的API并支持Prom语法，与其他组件（Sidecar或Store Gateway）一起协同工作，用于查询Prometheus的数据指标和做为Grafana的监控展示数据源。

Store Gateway（存储网关）

该组件实现与Sidercar一致的API提供给Querier进行查询，当Sidecar将数据存储到对象存储后，Prometheus会清理掉本地数据保证本地空间可用。当Querier需要调取历史数据时，则会通过Store Gateway读取对象存储中保存的数据。

Comactor（压缩组件）

主要用于对采集到的数据进行压缩和降采样，以提升对长期数据的查询效率。

Ruler（规则组件）

用于对多个Alertmanager的告警规则进行统一管理。

Receiver（接收器）

接收Promehtus的 remote-write数据，用于Receiver模式下的数据收集。

Thanos有两种运行模式，分别为Sidecar和Receiver，区别在于Sidercar主动获取Prometheus数据，而Receiver则是被动接收remote-write传送的数据。由于Receiver模式很少使用，本文不做介绍，只讲解Sidercar模式。

Sidercar模式官方架构图：
在这里插入图片描述

3、Prometheus配置

Thanos本身并不从目标实例处采集指标，监控指标的采集依然由Prometheus来完成。Thanos对Prometheus的版本有要求，需要部署2.21版本以上。

在本文示例中，我们部署两个Prometheus节点，分别为prom1和prom2，用于实现高可用的功能。

Prometheus相关的安装部署方法可参见前面的文章，此处不再过多叙述。

prom1：192.168.75.168 promethues + thanos sidecar + thanos store gateway + thanos query
prom2: 192.168.75.169 promethues + thanos sidecar + thanos store gateway + thanos query

在高可用的情况下，由于数据会有重复，需要在external_labels的标签集区分不同的实例。本文通过增加replica标签做区分，在不同的实例填不同的值，如0或1。

注意：标签区分这一步非常重要，因为Thanos的数据去重功能依赖external_labels来区分不同的实例。

prom1

global:
  scrape_interval:     30s
  evaluation_interval: 60s
  external_labels:
    env: dev
    replica: 0

prom2

global:
  scrape_interval:     15s
  evaluation_interval: 15s
  external_labels:
    env: dev
    replica: 1

启动Prometheus

prometheus  \
   --config.file=/usr/local/prometheus/prometheus.yml \
   --storage.tsdb.path=/usr/local/prometheus/data \
   --web.listen-address='0.0.0.0:9090' \
   --storage.tsdb.max-block-duration=2h \
   --storage.tsdb.min-block-duration=2h \
   --storage.tsdb.retention=2h \
   --web.enable-lifecycle &

配置Prometheus.service

vim /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
Restart=on-failure
ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/usr/local/prometheus/data --storage.tsdb.min-block-duration=2h --storage.tsdb.max-block-duration=2h --storage.tsdb.retention=2h --storage.tsdb.wal-compression --web.enable-lifecycle --log.level=info --web.listen-address=0.0.0.0:9090

[Install]
WantedBy=multi-user.target

systemctl daemon-reload && systemctl enable prometheus && systemctl start prometheus && systemctl status prometheus

注释：--storage.tsdb.min-block-duration和--storage.tsdb.min-block-duration参数必须添加且配置相同的值，用于关闭Prometheus的本地压缩功能，否则在使用compactor在压缩数据时会出现问题，并且会在Sidecar启动时报错；--storage.tsdb.retention.time配置本地只保留两小时数据，减少空间占用；--web.enable-lifecycle 用于支持Webhook方式重新加载Prometheus配置。Sidercar会关注Prometheus的配置文件，并在出现变化时通过Webhook方式让Prometheus自动更新配置。

打开浏览器，访问http://$IP:9090，可以看到Prometheus已正常启动。

在这里插入图片描述

4、部署Thanos

4.1 集群架构

我们在上面的两个Prometheus的节点服务器中部署Sidercar，用于获取监控数据。同时，配置历史数据写入到对象存储中进行持久化保存。部署一个Store Gateway对接对象存储，而Compactor组件会定时对存储中数据进行压缩索引及降采样操作。

Querier做为面向用户的组件，对接Sidercar和Store Gateway获取数据并进行展示。（另外还有的Receiver和ruler组件由于使用不是很多，本文不做介绍，有需要可自行查阅。）

在这里插入图片描述

4.2 下载安装包

Thanos虽然有多个功能组件，但都是使用同一个二进制文件进行部署，通过不同的启动命令启用不同的功能，非常方便。

此处我们下载当前的v0.23.1版本，并将解压的二进制文件放到bin目录中。

wget https://github.com/thanos-io/thanos/releases/download/v0.23.1/thanos-0.23.1.linux-amd64.tar.gz
tar -xvf thanos-0.23.1.linux-amd64.tar.gz 
cd thanos-0.23.1.linux-amd64
mv thanos  /usr/local/bin/

4.3 Sidecar部署

Sidercar为Thanos的关键组件，通过Sidercar实现了与Prometheus的无缝集成。Sidercar对Prometheus的实例几乎没有任何影响，Prometheus依然做为一个独立的实例运行，你不需要对其配置进行修改。

Siderca提供了一个数据API，用于我们查询Prometheus中的指标数据，同时默认会将超过两个小时的数据上传到对象存储，进行备份保存。

Sidercar组件必须与Prometheus运行在同一台机器或同一个Kubernetes Pod中，启动命令如下所示：

thanos sidecar \
    --tsdb.path              /usr/local/prometheus/data/ \
    --prometheus.url         http://localhost:9090 \
    --objstore.config-file   bucket_config.yaml \
    --http-address              0.0.0.0:19191 \
    --grpc-address              0.0.0.0:19091 &

启动报错

caller=sidecar.go:159 msg="failed to fetch prometheus version.
 Is Prometheus running? Retrying" err="expected 2xx response, 
 got 405. Body: Method Not Allowed\n"

解决办法：升级Prometheus版本，prometheus-2.43.0

配置thanos-sidecar.service

vim /etc/systemd/system/thanos-sidecar.service
[Unit]
Description=Thanos Sidecar
After=network-online.target

[Service]
Restart=on-failure
# --prometheus.url=http://localhost:9090 --> 指定本节点的 prometheus 地址
# --tsdb.path=/usr/local/prometheus/data/ --> 指定本节点的 prometheus 数据目录
ExecStart=/usr/local/thanos/thanos sidecar --prometheus.url=http://localhost:9090 --tsdb.path=/usr/local/prometheus/data/ --objstore.config-file=/root/bucket_config.yaml --grpc-address=0.0.0.0:19090 --http-address=0.0.0.0:19091

[Install]
WantedBy=multi-user.target

systemctl daemon-reload && systemctl enable thanos-sidecar && systemctl start thanos-sidecar && systemctl status thanos-sidecar

注释：–tsdb.path 用于指定Prometheus 数据存储路径；–prometheus.url 指定Prometheus访问地址；–objstore.config-file 设置上传数据的对象存储信息；–http-address 配置http端口，用于提供Sidercar的metrics信息；–grpc-address 配置grpc端口，用于与其他Thanos组件通信；

Thanos支持多种对象存储及本地文件系统，如下列表所示：

在这里插入图片描述

以阿里云的oss为例，如下为bucket.yml配置格式（阿里云OSS需要自行开通并配置access_key_id和access_key_secret）：

type: ALIYUNOSS
config:
  endpoint: "oss-cn-shenzhen.aliyuncs.com"
  bucket: "prome"
  access_key_id: "LTAI5t99S11iBx9iwv"
  access_key_secret: "BKSMK4riSg5HG6AFRY14ZRA3L"

Sidecar默认会每隔两个小时备份数据到对象存储，当Sidercar运行超过两个小时后，我们可以在对象存储中看到备份的数据。

4.4 Store Gateway部署

当Sidercar把监控数据备份到对象存储后，我们只需要在Prometheus中保留短期的数据，如数个小时。这样可以减少Prometheus的压力，也可用于应付大部分的查询。当需要查询历史数据时，我们可以通过Store Gateway来查询对象存储中保存的数据。

Store Gateway 主要与对象存储交互，从对象存储获取已经保存的数据。Store Gateway实现与Sidecar一样的数据api，Querier组件可以从Store Gateway 查询历史数据。Store Gateway支持横向扩展，可配置多个Store Gateway拉取多个对象存储数据。

启动命令：

thanos store \
    --data-dir             /usr/local/thanos/data \
    --objstore.config-file bucket_config.yaml \
    --http-address         0.0.0.0:19194 \
    --grpc-address         0.0.0.0:19094 &

配置thanos-store-gateway.service

vim /etc/systemd/system/thanos-store-gateway.service
[Unit]
Description=Thanos Store Gateway
After=network-online.target

[Service]
Restart=on-failure
ExecStart=/usr/local/thanos/thanos store --data-dir=/usr/local/thanos/data --objstore.config-file=/root/bucket_config.yaml --http-address=0.0.0.0:10902 --grpc-address=0.0.0.0:10901

[Install]
WantedBy=multi-user.target

systemctl daemon-reload && systemctl enable thanos-store-gateway && systemctl start thanos-store-gateway && systemctl status thanos-store-gateway

注释：–data-dir 配置缓存目录地址，Store Gateway会在本地磁盘上保留有关所有远程块的少量信息，并使其与存储桶保持同步；–objstore.config-file 设置对象存储信息，与Sidercar的配置一致；–http-address 配置http端口，用于提供访问界面和metrics信息；–grpc-address 配置grpc端口，用于与其他Thanos组件通信。

程序启动后，打开浏览器访问http://192.168.75.168:10902/，可以看到对象存储中已存储的块信息（需要等待2小时才会有数据上传）。
在这里插入图片描述

4.5 Querier部署

前面我们已经配置好Sidercar和Store Gateway组件，接下来我们为Thanos配置一个全局查询界面，即Querier组件。Querier连接Sidercar后，会根据给定的PromQL查询自动检测需要联系哪些Prometheus服务器，当要查询历史数据时，则是通过Store Gateway查询对象存储内容。

Querier还提供与Prometheus官方一致的HTTP API，可以接入Grafana进行监控展示，并支持PromQL语法。

启动命令如下所示：

thanos query \
    --http-address 0.0.0.0:19192 \
    --grpc-address 0.0.0.0:19092 \
    --query.replica-label replica \
    --store        192.168.75.168:19091 \ #prom1实例Sidercar地址
    --store        192.168.75.169:19091 \ #prom2实例Sidercar地址
    --store        192.168.75.168:19094 &   #Store Gateway地址

配置thanos-query.service

vim /etc/systemd/system/thanos-query.service
[Unit]
Description=Thanos Query
After=network-online.target

[Service]
Restart=on-failure
# --store=192.168.75.168:19090 --> 请求当前节点 thanos sidecar 数据
# --store=192.168.75.169:19090 --> 请求另外一个节点 thanos sidecar 数据
# --store=192.168.75.168:10901 --> 请求当前节点 thanos store gateway 数据
# --store=192.168.75.169:10901 --> 请求另外一个节点 thanos store gateway 数据
# --query.replica-label "replica" --query.replica-label "region" --> 加上后，thanos query 查询同一节点的数据时，会自动去重
ExecStart=/usr/local/thanos/thanos query --http-address=0.0.0.0:19192 --grpc-address 0.0.0.0:19092 --store=192.168.75.168:19090 --store=192.168.75.169:19090 --store=192.168.75.168:10901 --store=192.168.75.169:10901 --query.replica-label "replica" --query.replica-label "replica"

[Install]
WantedBy=multi-user.target

systemctl daemon-reload && systemctl enable thanos-query && systemctl start thanos-query && systemctl status thanos-query

注释：–query.replica-label 指定重复数据删除的标记label，query在查询数据时，将根据此label评估是否重复数据；–store 用于指定Sidercar和Store Gateway的连接地址，此示例前两个store配置了prom1和prom2的Sidercar地址，第三个则是连接到store gateway；–http-address 指定Querier的UI界面访问地址；

启动完成后，打开浏览器，访问http://$IP:19192 即可查看Querier的UI界面。

“Use Deduplication”项默认已勾选，Querier会根据指定的扩展标签进行数据去重，这保证了在用户层面不会因为高可用模式而出现重复数据的问题。

“Use Partial Response”选项用于允许部分响应的情况，这要求用户在一致性与可用性之间做选择。当某个store出现故障而无法响应时，此时是否还需要返回其他正常store的数据。如果对可用性要求更高的场景，可以勾选此项。
在这里插入图片描述

Querier兼容Prometheus的API接口，因此，Grafana可直接添加Querier组件地址做为Prometheus数据源。

在这里插入图片描述

4.6 Compactor 部署

在默认情况下，Prometheus会定期压缩旧数据以提升查询效率，但在前面的操作中，我们关闭了此功能。因此，我们需要使用Compactor组件对存储中的数据进行类似操作。

Compactor会为以下数据进行降采样操作，降采样有利于对大时间范围的查询提供快速获取结果的能力：

为大于 40 小时 (2d, 2w) 的块创建 5m 降采样；
为大于 10 天 (2w) 的块创建 1 小时降采样。

同时，Compactor也将对索引进行压缩，包括将来自同个Prometheus实例的多个块压缩为一个，这有利于对提升数据的查询效率。Compactor通过扩展标签集来区分不同的Prometheus实例，所以，在Thanos集群中不同实例的Prometheus 扩展标签集必须是唯一的，否则会导致Compactor出现错误。

启动命令如下所示：

thanos compact \
    --data-dir             /var/thanos/compact \
    --objstore.config-file bucket_config.yaml \
    --http-address         0.0.0.0:19191

注释：--data-dir 指定用于数据处理的临时工作空间，为保证Compactor正常工作，建议提供100-300G的容量；--objstore.config-file 指定对象存储的信息，Thanos的其他组件如Sdiercar、Store Gateway只需要提供对象存储的读写权限即可，但Compactor还需要提供删除权限，因为它会对数据进行操作；--http-address compact的http地址，用于提供metrics指标。

目前，Thanos还不支持多个Compactor对单个存储对象存储并发进行操作的功能，必须以单实例的方式运行compact。在任务完成后，Compactor的程序会退出，我们可将其做为定时任务的方式来运行。如果需要程序保持运行，可使用--wait和--wait-interval 参数实现。

由于对象存储本身对于数据是永久保留，如果希望只保留指定时间内的数据，可以通过配置Compactor的--retention.resolution-raw 、--retention.resolution-5m 和 --retention.resolution-1h 三个参数来实现，分别为原始数据的保留时间、5m分钟降采样数据的保留时间和1小时降采样数据保留时间，其中--retention.resolution-raw 不能小于其他两个时间段，否则会影响compact的降采样操作。

5、其他

1、prometheus 的配置参数 external_labels 和 thanos query 启动参数 --query.replica-label “replica” --query.replica-label “region” 的作用

在这里插入图片描述

2、thanos 只会保留其中一个节点上传的持久化监控数据（根据配置，要 2 小时后才会上传到阿里云 OSS）

在这里插入图片描述

3、除了实现 prometheus 高可用场景，thanos 也可以用作多个 prometheus 的汇总查询入口，只需部署 thanos sidecar 和 thanos store gateway 到相应的 prometheus 节点，然后 thanos query 通过参数 --store=ip:port 连接 thanos sidecar 和 thanos store gateway 即可

结语：

当前中文网上对于Thanos产品组件介绍的相关资料较少，并且由于产品还在不断迭代中，原有的文档往往已经不再适用。在研究的过程中笔者也只能依赖于官网的英文资料，并在实际使用中做验证。本文应该算是目前比较系统性介绍产品的一份资料，按照文中的操作方法，可以搭建出满足实际需要的集群架构。

上一篇：Prometheus监控实战系列二十四： Alertmanager集群
下一篇：待续

GitCode 开源社区

旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

[转载]在Windows环境下安装GNU Radio

转自：在Windows环境下安装GNURadio_恐弱智_新浪博客GNU Radio是用Python开发的，大部分开源的工程能够在Linux环境下运行良好，而Windows下却运行的很勉强，而且安装配置都很复杂。GNU Radio算是个例外了，不光提供了Windows的二进制安装，还有比较详细的说明。我是Python小白，所以折腾了好久才弄好，特意记录下来，免得以后再装还折腾。GNU Radio的

GitCode 开源社区

centOS 8 使用dnf安装Docker

DNF是什么？CentOS 8使用YUM软件包管理器版本v4.0.4。现在，该版本使用DNF(已删除YUM)。DNF是软件包管理器。它会在Linux发行版上安装，执行更新并删除软件包。使用DNF安装Docker跳过具有损坏依赖性的程序包一个有效的解决方案是使您的CentOS 8系统使用以下--nobest命令安装最符合条件的版本：sudo dnf install docker...

GitCode 开源社区

定时同步数据库表(mysql+linux+crontab)

sync.sh里面的参数需要改变，ip/username/password/database/tablesync.sh#!/bin/sh# Please change the IP and password of the data source db.# Then change the table name.filename=/home/nington/db/$(date +%Y-%m