Kubernetes 可观测性体系:监控、日志与链路追踪

Kubernetes 可观测性体系:监控、日志与链路追踪

1. 可观测性三大支柱

1
2
3
4
5
6
7
8
9
10
11
12
13
┌─────────────────────────────────────────────────────────┐
│ 可观测性(Observability) │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │
│ │ Metrics │ │ Logs │ │ Traces │ │
│ │ (指标) │ │ (日志) │ │ (链路追踪) │ │
│ │ │ │ │ │ │ │
│ │ Prometheus │ │ EFK/Loki │ │ Jaeger/Tempo │ │
│ │ Grafana │ │ Elasticsearch│ │ OpenTelemetry │ │
│ └─────────────┘ └─────────────┘ └─────────────────┘ │
│ │
│ 回答:发生了什么? 为什么发生? 哪里发生的? │
└─────────────────────────────────────────────────────────┘

2. Prometheus 监控体系

2.1 架构概览

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
┌─────────────────────────────────────────────────────────┐
│ 数据采集层 │
Node Exporter → 节点指标 │
│ kube-state-metrics → K8s 对象状态指标 │
│ cAdvisor(内置于 kubelet)→ 容器资源指标 │
│ 应用自定义指标(/metrics 端点) │
└──────────────────────────┬──────────────────────────────┘
│ Scrape(拉取)
┌──────────────────────────▼──────────────────────────────┐
│ Prometheus Server │
│ ├── TSDB(时序数据库) │
│ ├── PromQL 查询引擎 │
│ └── Alertmanager 集成 │
└──────────────────────────┬──────────────────────────────┘

┌────────────┼────────────┐
▼ ▼ ▼
Grafana Alertmanager 远程存储
(可视化) (告警路由) (Thanos/Cortex)

2.2 kube-prometheus-stack 部署

1
2
3
4
5
6
7
8
# 使用 Helm 部署完整监控栈
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=admin123 \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

2.3 ServiceMonitor(自动发现)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-monitor
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
app: my-app
namespaceSelector:
matchNames:
- production
endpoints:
- port: metrics
path: /metrics
interval: 30s
scrapeTimeout: 10s

2.4 关键 Kubernetes 监控指标

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# ===== 集群层面 =====

# 节点 CPU 使用率
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 节点内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# 节点磁盘使用率
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

# ===== Pod 层面 =====

# Pod CPU 使用率(毫核)
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace) * 1000

# Pod 内存使用量
sum(container_memory_working_set_bytes{container!=""}) by (pod, namespace)

# Pod 重启次数
increase(kube_pod_container_status_restarts_total[1h])

# ===== 应用层面 =====

# HTTP 请求 QPS
rate(http_requests_total[5m])

# HTTP 错误率
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# HTTP P99 延迟
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# ===== etcd =====

# etcd Leader 切换
increase(etcd_server_leader_changes_seen_total[1h])

# etcd 磁盘延迟
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))

2.5 告警规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-alerts
namespace: monitoring
spec:
groups:
- name: kubernetes.rules
rules:
# Pod 频繁重启
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} 频繁重启"
description: "过去 1 小时重启 {{ $value }} 次"

# 节点内存压力
- alert: NodeMemoryPressure
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "节点 {{ $labels.instance }} 内存使用率超过 90%"

# PVC 空间不足
- alert: PVCAlmostFull
expr: |
kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} 剩余空间不足 10%"

# API Server 延迟
- alert: APIServerHighLatency
expr: |
histogram_quantile(0.99,
rate(apiserver_request_duration_seconds_bucket{verb!="WATCH"}[5m])) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "API Server P99 延迟超过 1s"

2.6 Alertmanager 配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
global:
resolve_timeout: 5m

route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'

receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#k8s-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

- name: 'pagerduty'
pagerduty_configs:
- service_key: 'xxx'

3. Grafana 仪表板

3.1 推荐仪表板

1
2
3
4
Kubernetes 集群概览:ID 315
Node Exporter Full:ID 1860
Kubernetes Pod 监控:ID 6417
etcd 监控:ID 3070

3.2 自定义仪表板变量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"query": "label_values(kube_pod_info, namespace)"
},
{
"name": "pod",
"type": "query",
"query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)"
}
]
}
}

4. 日志体系

4.1 日志收集架构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
┌─────────────────────────────────────────────────────────┐
│ 日志来源 │
│ ├── 容器标准输出(stdout/stderr)→ /var/log/containers/
│ ├── 系统日志(journald) │
│ └── 应用日志文件 │
└──────────────────────────┬──────────────────────────────┘

┌──────────────────────────▼──────────────────────────────┐
│ 日志采集(DaemonSet) │
│ Fluentd / Fluent Bit / Filebeat / Vector │
└──────────────────────────┬──────────────────────────────┘

┌────────────┼────────────┐
▼ ▼ ▼
Elasticsearch Loki ClickHouse
(全文搜索) (低成本) (高性能分析)


Kibana / Grafana

4.2 Fluent Bit 配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Fluent Bit ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
Parsers_File parsers.conf

[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
Refresh_Interval 5
Mem_Buf_Limit 50MB
Skip_Long_Lines On

[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Merge_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude On

[OUTPUT]
Name es
Match *
Host elasticsearch.logging.svc.cluster.local
Port 9200
Index kubernetes
Type _doc
Logstash_Format On
Logstash_Prefix kubernetes

4.3 Loki(轻量级日志方案)

1
2
3
4
5
6
7
8
9
10
11
12
13
# Loki 查询示例(LogQL)
# 查看 production 命名空间的错误日志
{namespace="production"} |= "ERROR"

# 统计错误日志速率
rate({namespace="production"} |= "ERROR" [5m])

# 解析 JSON 日志
{app="my-app"} | json | level="error" | line_format "{{.message}}"

# 统计 HTTP 状态码
{app="nginx"} | pattern `<_> "<method> <_> <_>" <status> <_>`
| status >= 500

5. 链路追踪(Distributed Tracing)

5.1 OpenTelemetry 标准

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Go 应用集成 OpenTelemetry
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/trace"
)

func handleRequest(ctx context.Context, req *Request) {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "handleRequest")
defer span.End()

span.SetAttributes(
attribute.String("user.id", req.UserID),
attribute.String("request.method", req.Method),
)

// 调用下游服务(自动传播 trace context)
result, err := callDownstream(ctx, req)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
}
}

5.2 Jaeger 部署

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 简单部署(all-in-one)
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: tracing
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.52
ports:
- containerPort: 16686 # UI
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
env:
- name: COLLECTOR_OTLP_ENABLED
value: "true"

5.3 OpenTelemetry Collector

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# OTel Collector 配置
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

processors:
batch:
timeout: 1s
send_batch_size: 1024

resource:
attributes:
- key: k8s.cluster.name
value: production
action: insert

exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true

prometheus:
endpoint: "0.0.0.0:8889"

service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]

6. Kubernetes Events 监控

1
2
3
4
5
6
7
8
# 查看集群事件
kubectl get events --all-namespaces --sort-by='.lastTimestamp'

# 只看 Warning 事件
kubectl get events --field-selector type=Warning

# 持续监听事件
kubectl get events -w
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 使用 kube-eventer 将事件发送到 Elasticsearch
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-eventer
spec:
template:
spec:
containers:
- name: kube-eventer
image: registry.aliyuncs.com/acs/kube-eventer-amd64:v1.2.0
command:
- /kube-eventer
- --source=kubernetes:https://kubernetes.default
- --sink=elasticsearch:http://elasticsearch:9200?index=kube-events

7. SLO/SLA 监控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# 使用 Sloth 定义 SLO
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: my-app-slo
spec:
service: my-app
slos:
- name: requests-availability
objective: 99.9 # 99.9% 可用性
description: "HTTP 请求成功率"
sli:
events:
errorQuery: |
sum(rate(http_requests_total{status=~"5..",job="my-app"}[{{.window}}]))
totalQuery: |
sum(rate(http_requests_total{job="my-app"}[{{.window}}]))
alerting:
name: MyAppHighErrorRate
pageAlert:
labels:
severity: critical
ticketAlert:
labels:
severity: warning

8. 可观测性最佳实践

8.1 USE 方法(资源监控)

1
2
3
Utilization(使用率):资源使用了多少?
Saturation(饱和度):资源是否过载?
Errors(错误):是否有错误?

8.2 RED 方法(服务监控)

1
2
3
Rate(速率):每秒请求数
Errors(错误):错误率
Duration(延迟):请求延迟分布

8.3 四个黄金信号

1
2
3
4
延迟(Latency):请求处理时间
流量(Traffic):系统负载
错误(Errors):错误率
饱和度(Saturation):资源使用程度

9. 总结

完整的 Kubernetes 可观测性体系:

  1. Metrics:Prometheus + Grafana,覆盖集群、节点、Pod、应用四个层次
  2. Logs:Fluent Bit + Loki/Elasticsearch,结构化日志 + 全文搜索
  3. Traces:OpenTelemetry + Jaeger,分布式请求链路追踪
  4. Alerts:Alertmanager 多渠道告警,基于 SLO 的告警策略
  5. Events:Kubernetes Events 监控,快速发现集群异常

可观测性不是事后补救,而是系统设计的一部分。


Kubernetes 可观测性体系:监控、日志与链路追踪
https://k8s.chucz.asia/Kubernetes可观测性体系/
作者
K8s Engineer
发布于
2026年1月21日
许可协议