kube-controller-manager 深度解析:集群的自愈大脑 1. 控制器模式(Controller Pattern) 控制器模式是 Kubernetes 最核心的设计模式,其本质是一个无限调谐循环 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ┌─────────────────────────────────────────────────────┐ │ 调谐循环(Reconcile Loop ) │ │ │ │ 观察(Observe) │ │ 从 apiserver 获取当前状态 │ │ │ │ │ ▼ │ │ 分析差异(Diff) │ │ 当前状态 vs 期望状态 │ │ │ │ │ ▼ │ │ 执行动作(Act) │ │ 创建/更新/删除资源,使状态收敛 │ │ │ │ │ └──────────────────────────────────────────┘ └─────────────────────────────────────────────────────┘
2. kube-controller-manager 架构 kube-controller-manager 是一个进程 ,内部运行了数十个控制器:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 kube-controller-manager ├── DeploymentController ├── ReplicaSetController ├── StatefulSetController ├── DaemonSetController ├── JobController ├── CronJobController ├── NodeController ├── ServiceController ├── EndpointController ├── NamespaceController ├── ServiceAccountController ├── PersistentVolumeController ├── GarbageCollectorController ├── HorizontalPodAutoscalerController └── ... (共 30+ 个控制器)
高可用:Leader Election 多个 controller-manager 实例同时运行时,通过 Leader Election 确保只有一个实例真正工作:
1 2 3 4 kubectl get lease -n kube-system kube-controller-manager
3. Deployment 控制器 Deployment 是最常用的工作负载,其背后由两个控制器协作:
1 2 3 4 5 6 7 8 9 Deployment Controller │ │ 管理 ▼ ReplicaSet Controller │ │ 管理 ▼ Pods
3.1 滚动更新原理 1 2 3 4 5 6 spec: strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0
滚动更新过程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 初始状态:RS-v1 (3 pods) Step 1: 创建 RS-v2,扩容到 1 个 Pod RS-v1: 3 pods RS-v2: 1 pod (总 4,超出 1) Step 2: RS-v1 缩容到 2 个 Pod RS-v1: 2 pods RS-v2: 1 pod (总 3,正常) Step 3: RS-v2 扩容到 2 个 Pod RS-v1: 2 pods RS-v2: 2 pods (总 4,超出 1) Step 4: RS-v1 缩容到 1 个 Pod RS-v1: 1 pod RS-v2: 2 pods (总 3,正常). ..直到 RS-v1: 0, RS-v2: 3
3.2 版本回滚 1 2 3 4 5 6 7 8 9 10 11 12 kubectl rollout history deployment/nginx kubectl rollout undo deployment/nginx kubectl rollout undo deployment/nginx --to-revision=2 kubectl rollout pause deployment/nginx kubectl rollout resume deployment/nginx
3.3 Deployment 状态 1 2 3 4 kubectl rollout status deployment/nginx
4. ReplicaSet 控制器 ReplicaSet 确保指定数量的 Pod 副本始终运行:
1 2 3 4 5 6 7 8 9 10 11 12 13 func (c *ReplicaSetController) syncReplicaSet(rs *ReplicaSet) { currentPods := c.getPodsForRS(rs) diff := rs.Spec.Replicas - len (currentPods) if diff > 0 { c.createPods(diff, rs) } else if diff < 0 { c.deletePods(-diff, currentPods) } }
Pod 选择器(Label Selector) 1 2 3 4 5 6 7 8 9 10 spec: selector: matchLabels: app: nginx version: v1 template: metadata: labels: app: nginx version: v1
注意:ReplicaSet 通过 Label Selector 认领 Pod,而不是通过 ownerReference。这意味着手动修改 Pod 的 Label 可以让 Pod 脱离 RS 的管理。
5. StatefulSet 控制器 StatefulSet 为有状态应用提供:
稳定的网络标识 :pod-0, pod-1, pod-2
稳定的存储 :每个 Pod 有独立的 PVC
有序部署和扩缩容 :按序号顺序操作
5.1 Pod 命名规则 1 2 {StatefulSet名称}-{序号} mysql-0 , mysql-1 , mysql-2
5.2 Headless Service StatefulSet 需要配合 Headless Service 使用:
1 2 3 4 5 6 7 8 9 10 apiVersion: v1 kind: Service metadata: name: mysql spec: clusterIP: None selector: app: mysql ports: - port: 3306
DNS 解析:
1 2 3 mysql-0 .mysql .default .svc .cluster .local → Pod-0 IP mysql-1 .mysql .default .svc .cluster .local → Pod-1 IP mysql-2 .mysql .default .svc .cluster .local → Pod-2 IP
5.3 有序操作 1 2 3 4 5 6 7 8 spec: podManagementPolicy: OrderedReady updateStrategy: type: RollingUpdate rollingUpdate: partition: 2
5.4 PVC 模板 1 2 3 4 5 6 7 8 9 10 spec: volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce" ] storageClassName: fast-ssd resources: requests: storage: 10Gi
删除 StatefulSet 不会删除 PVC,需要手动清理。
6. DaemonSet 控制器 DaemonSet 确保每个(或指定)节点上运行一个 Pod 副本。
典型使用场景 1 2 3 4 日志收集:Fluentd, Filebeat 监控 Agent:Prometheus Node Exporter , Datadog Agent 网络插件:Calico, Flannel, Cilium 存储插件:Ceph, GlusterFS
节点选择 1 2 3 4 5 6 7 8 9 10 11 12 spec: selector: matchLabels: app: fluentd template: spec: nodeSelector: kubernetes.io/os: linux tolerations: - key: node-role.kubernetes.io/master effect: NoSchedule
更新策略 1 2 3 4 5 spec: updateStrategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1
7. Job 与 CronJob 控制器 7.1 Job 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 apiVersion: batch/v1 kind: Job metadata: name: data-migration spec: completions: 5 parallelism: 2 backoffLimit: 3 activeDeadlineSeconds: 600 template: spec: restartPolicy: Never containers: - name: worker image: batch-worker:v1
7.2 CronJob 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 apiVersion: batch/v1 kind: CronJob metadata: name: backup spec: schedule: "0 2 * * *" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 3 failedJobsHistoryLimit: 1 jobTemplate: spec: template: spec: restartPolicy: OnFailure containers: - name: backup image: backup-tool:v1
8. Node 控制器 Node 控制器负责管理节点的生命周期:
8.1 节点状态监控 1 2 3 4 5 kubelet 每 10s 更新一次 Node Status Node 控制器每 5s 检查一次节点状态 节点 40s 未更新 → 标记为 Unknown 节点 5m in 未更新 → 开始驱逐 Pod
8.2 驱逐速率限制 1 2 3 4 --node-eviction-rate=0.1 --secondary-node-eviction-rate=0.01 --unhealthy-zone-threshold=0.55
9. HPA(HorizontalPodAutoscaler)控制器 HPA 根据指标自动调整 Pod 副本数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: web-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: web minReplicas: 2 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: AverageValue averageValue: 500Mi behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Percent value: 100 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300
HPA 计算公式 1 2 3 4 期望副本数 = ceil(当前副本数 × (当前指标值 / 目标指标值)) 例:当前 3 副本,CPU 使用率 90 %,目标 70 % 期望副本数 = ceil(3 × (90 /70 )) = ceil(3.86 ) = 4
10. GarbageCollector 控制器 垃圾回收控制器负责清理孤儿资源 (ownerReference 指向的对象已被删除)。
ownerReference 1 2 3 4 5 6 7 8 metadata: ownerReferences: - apiVersion: apps/v1 kind: ReplicaSet name: nginx-rs-abc123 uid: xxx-yyy-zzz controller: true blockOwnerDeletion: true
删除策略 1 2 3 4 5 6 7 8 kubectl delete deployment nginx kubectl delete deployment nginx --cascade=orphan kubectl delete deployment nginx --cascade=foreground
11. 自定义控制器(Operator 模式) 基于 controller-runtime 框架开发自定义控制器:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 type MyAppReconciler struct { client.Client Scheme *runtime.Scheme }func (r *MyAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error ) { myApp := &myv1.MyApp{} if err := r.Get(ctx, req.NamespacedName, myApp); err != nil { return ctrl.Result{}, client.IgnoreNotFound(err) } deployment := &appsv1.Deployment{} err := r.Get(ctx, req.NamespacedName, deployment) if errors.IsNotFound(err) { return ctrl.Result{}, r.createDeployment(ctx, myApp) } return ctrl.Result{}, r.updateDeployment(ctx, myApp, deployment) }
12. 总结 kube-controller-manager 是 Kubernetes 自愈能力的核心:
控制器模式 :声明式 API + 调谐循环 = 自动化运维
Deployment :无状态应用的标准管理方式,支持滚动更新和回滚
StatefulSet :有状态应用的解决方案,提供稳定标识和存储
DaemonSet :节点级别的 Agent 部署
HPA :基于指标的自动弹性伸缩
GC :自动清理孤儿资源,防止资源泄漏
理解控制器模式是开发 Kubernetes Operator 的基础。