第十一章：集群部署

11.1 高可用概述

Prometheus 单机部署在大规模场景下会遇到存储容量和查询性能瓶颈，需要通过集群方案解决。

11.1.1 高可用挑战

挑战	说明	解决方案
数据持久化	单机磁盘有限	远程存储
全球视图	多数据中心	联邦集群
查询性能	大量指标查询慢	查询并行化
可用性	单点故障	HA 部署
历史数据	短期存储不够	长效存储

11.1.2 架构演进

单机部署 → HA 部署 → 联邦集群 → 全局视图
   │          │          │           │
   │          │          │           ▼
   │          │          │      Thanos/Cortex
   │          │          ▼
   │          │        联邦
   │          ▼
   └────── HA (双副本)

11.2 HA 部署

11.2.1 双副本部署

                    ┌─────────────────┐
                    │  Load Balancer  │
                    └────────┬────────┘
                             │
              ┌──────────────┴──────────────┐
              ▼                              ▼
    ┌─────────────────┐           ┌─────────────────┐
    │   Prometheus 1  │           │   Prometheus 2  │
    │   (副本 1)       │           │   (副本 2)       │
    └────────┬────────┘           └────────┬────────┘
              │                              │
              └──────────────┬───────────────┘
                             ▼
                    ┌─────────────────┐
                    │   Alertmanager   │
                    │   (集群)         │
                    └─────────────────┘

11.2.2 配置示例

# prometheus.yml - 副本 1
global:
  external_labels:
    replica: replica-1
    prometheus: prometheus-ha-1

# prometheus.yml - 副本 2
global:
  external_labels:
    replica: replica-2
    prometheus: prometheus-ha-2

11.2.3 Alertmanager 集群

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'

route:
  receiver: 'default'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://webhook:5000/alerts'

# 集群配置
cluster:
  listen_address: '0.0.0.0:9094'
  advertise_address: 'prometheus1:9094'
  peers:
    - prometheus1:9094
    - prometheus2:9094

11.3 Thanos 架构

11.3.1 Thanos 组件

┌─────────────────────────────────────────────────────────────┐
│                        Thanos Query                         │
│                    (全局查询界面)                            │
└────────────────────────────┬────────────────────────────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
        ▼                    ▼                    ▼
┌───────────────┐   ┌───────────────┐   ┌───────────────┐
│  Thanos Sidecar │   │  Thanos Sidecar │   │  Thanos Sidecar │
│  (Prometheus 1) │   │  (Prometheus 2) │   │  (Prometheus 3) │
└───────┬───────┘   └───────┬───────┘   └───────┬───────┘
        │                   │                    │
        ▼                   ▼                    ▼
   ┌────────┐           ┌────────┐           ┌────────┐
   │Object  │           │Object  │           │Object  │
   │Storage │           │Storage │           │Storage │
   │(S3)    │           │(S3)    │           │(S3)    │
   └────────┘           └────────┘           └────────┘

组件	说明
Sidecar	连接到 Prometheus，备份数据到对象存储
Store	从对象存储读取历史数据
Query	统一查询入口，支持跨 Prometheus 查询
Ruler	告警和记录规则计算
Compactor	压缩和降采样对象存储中的数据
Receive	接收远程写入的数据
Bucket	对象存储浏览工具

11.3.2 Docker Compose 部署

# docker-compose.yml
version: '3.8'

services:
  thanos-query:
    image: quay.io/thanos/thanos:v0.32.0
    command:
      - query
      - --query.replicas=2
      - --store=thanos-sidecar-1:10901
      - --store=thanos-sidecar-2:10901
      - --http-address=0.0.0.0:9090
    ports:
      - "9090:9090"
    depends_on:
      - thanos-sidecar-1
      - thanos-sidecar-2

  thanos-sidecar-1:
    image: quay.io/thanos/thanos:v0.32.0
    command:
      - sidecar
      - --prometheus.url=http://prometheus-1:9090
      - --objstore.config-file=/etc/thanos/bucket.yml
      - --tsdb.path=/prometheus
    volumes:
      - ./bucket.yml:/etc/thanos/bucket.yml
      - prometheus1-data:/prometheus

  thanos-sidecar-2:
    image: quay.io/thanos/thanos:v0.32.0
    command:
      - sidecar
      - --prometheus.url=http://prometheus-2:9090
      - --objstore.config-file=/etc/thanos/bucket.yml
      - --tsdb.path=/prometheus
    volumes:
      - ./bucket.yml:/etc/thanos/bucket.yml
      - prometheus2-data:/prometheus

  thanos-store:
    image: quay.io/thanos/thanos:v0.32.0
    command:
      - store
      - --objstore.config-file=/etc/thanos/bucket.yml
      - --data-dir=/data
    volumes:
      - ./bucket.yml:/etc/thanos/bucket.yml
      - thanos-store-data:/data

  thanos-compactor:
    image: quay.io/thanos/thanos:v0.32.0
    command:
      - compact
      - --objstore.config-file=/etc/thanos/bucket.yml
      - --data-dir=/data
      - --wait
    volumes:
      - ./bucket.yml:/etc/thanos/bucket.yml
      - thanos-compactor-data:/data

  prometheus-1:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus1.yml:/etc/prometheus/prometheus.yml
      - prometheus1-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  prometheus-2:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus2.yml:/etc/prometheus/prometheus.yml
      - prometheus2-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

volumes:
  prometheus1-data:
  prometheus2-data:
  thanos-store-data:
  thanos-compactor-data:

11.3.3 对象存储配置

# bucket.yml (S3 示例)
type: S3
config:
  bucket: thanos-data
  endpoint: s3.amazonaws.com
  region: us-east-1
  access_key: AKIAIOSFODNN7EXAMPLE
  secret_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
  signature_version2: false

11.4 联邦集群

11.4.1 联邦架构

┌─────────────────────────────────────────────────────────┐
│                    Global Prometheus                     │
│                 (聚合所有数据中心数据)                     │
└────────────────────────────┬────────────────────────────┘
                             │
     ┌───────────────────────┼───────────────────────┐
     │                       │                       │
     ▼                       ▼                       ▼
┌─────────────┐         ┌─────────────┐         ┌─────────────┐
│  US-East   │         │  US-West   │         │   EU-Central │
│ Prometheus │         │ Prometheus │         │ Prometheus  │
└─────┬──────┘         └─────┬──────┘         └─────┬──────┘
      │                       │                       │
      ▼                       ▼                       ▼
┌─────────────┐         ┌─────────────┐         ┌─────────────┐
│  50+ Apps   │         │  30+ Apps   │         │  40+ Apps   │
└─────────────┘         └─────────────┘         └─────────────┘

11.4.2 Global Prometheus 配置

# global-prometheus.yml
global:
  scrape_interval: 60s
  evaluation_interval: 60s

scrape_configs:
  # 抓取各个数据中心
  - job_name: 'federate'
    honor_labels: true
    metrics_path: /federate
    params:
      'match[]':
        - '{__name__=~"job:.*"}'
        - '{__name__=~"instance:.*"}'
        - '{__name__=~"service:.*"}'
    static_configs:
      - targets:
          - us-east-prometheus:9090
          - us-west-prometheus:9090
          - eu-central-prometheus:9090
        labels:
          datacenter: federated

11.4.3 Regional Prometheus 配置

# regional-prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

# 记录规则 - 聚合后供 Global 使用
recording_rules:
  - name: federation_rules
    rules:
      # 服务 QPS
      - record: job:http_requests:rate5m
        expr: |
          sum by (job) (rate(http_requests_total[5m]))

      # 服务延迟 P99
      - record: job:http_latency_p99:seconds
        expr: |
          histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

scrape_configs:
  - job_name: 'applications'
    static_configs:
      - targets: ['app1:9100', 'app2:9100', 'app3:9100']

11.5 Cortex 多租户

11.5.1 Cortex 架构

┌─────────────────────────────────────────────────────────────┐
│                        Nginx / API Gateway                  │
└────────────────────────────┬────────────────────────────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
        ▼                    ▼                    ▼
┌───────────────┐   ┌───────────────┐   ┌───────────────┐
│   Distributor │   │   Distributor │   │   Distributor │
│   (租户分发)   │   │               │   │               │
└───────┬───────┘   └───────┬───────┘   └───────┬───────┘
        │                   │                   │
        ▼                   ▼                   ▼
┌─────────────────────────────────────────────────────────┐
│                    Kafka / Ingesters                     │
│  (接收写入，流式处理到存储后端)                           │
└─────────────────────────────────────────────────────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │  对象存储 (S3)    │
                    │  Cassandra/Mysql │
                    └─────────────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │  Querier/Ruler   │
                    │  (查询和规则)    │
                    └─────────────────┘

11.5.2 Helm 部署 Cortex

helm repo add cortex-helm https://cortexproject.github.io/cortex-helm-chart
helm repo update

helm install cortex cortex-helm/cortex \
  --namespace cortex \
  --create-namespace \
  -f values.yaml

# values.yaml
ingester:
  replicas: 3
  persistence:
    enabled: true
    size: 10Gi

distributor:
  replicas: 2

querier:
  replicas: 2
  maxConcurrent: 8

store_gateway:
  replicas: 2

ruler:
  enabled: true
  replicas: 2

config:
  storage:
    engine: chunks

  chunks_storage:
    backend: s3
    s3:
      bucket_name: cortex-chunks
      endpoint: s3.amazonaws.com
      region: us-east-1

  tsdb:
    backend: s3
    s3:
      bucket_name: cortex-tsdb

11.6 容量规划

11.6.1 集群规模估算

规模	Prometheus 数量	存储时间	查询 QPS
小型	1-5	15d	< 10
中型	5-20	30d	< 50
大型	20-100	60d	< 200
超大型	100+	90d+	200+

11.6.2 资源估算公式

# 存储估算
daily_samples = targets * scrape_interval * metrics_per_target
daily_gb = daily_samples * sample_size / compression_ratio

# 查询估算
query_load = dashboards * queries_per_dashboard * refresh_rate

# 内存估算
memory_gb = (metrics_count * avg_labels * label_size) + query_buffer

11.7 本章小结

本章介绍了 Prometheus 集群部署方案：

高可用概述 - 挑战和架构演进
HA 部署 - 双副本和 Alertmanager 集群
Thanos 架构 - 组件和部署配置
联邦集群 - 多数据中心聚合
Cortex - 多租户解决方案
容量规划 - 规模和资源估算

📖 下一步

第十二章：安全配置 - 安全加固
第十三章：最佳实践 - 工程实践