Pod 深度解析:Kubernetes 最小调度单元
1. Pod 的设计哲学
Pod 是 Kubernetes 的最小调度单元,而不是单个容器。这个设计来自 Google Borg 的经验:
将紧密协作的容器组合在一起,共享网络和存储,就像运行在同一台”逻辑主机”上。
为什么不直接调度容器?
1 2 3 4 5 6 7 8 9 10 11 12
| 问题:两个容器需要通过 localhost 通信,共享文件系统 解决:将它们放在同一个 Pod 中
Pod 内容器共享: ├── 网络命名空间(同一 IP,可通过 localhost 通信) ├── IPC 命名空间(可共享内存) └── 存储卷(可挂载相同的 Volume)
Pod 内容器隔离: ├── 进程命名空间(默认隔离,可配置共享) ├── 文件系统(各自独立,通过 Volume 共享) └── 用户命名空间
|
2. Pause 容器(Infra Container)
每个 Pod 都有一个隐藏的 pause 容器,它是 Pod 的”骨架”:
1 2 3 4 5 6 7 8 9 10 11
| Pod ├── pause 容器(infra) │ ├── 持有网络命名空间(eth0, lo) │ ├── 持有 IPC 命名空间 │ └── 生命周期与 Pod 相同 │ ├── 业务容器 A │ └── 加入 pause 的网络/IPC 命名空间 │ └── 业务容器 B └── 加入 pause 的网络/IPC 命名空间
|
pause 容器的作用:
- 作为 Pod 内所有容器的”父容器”
- 持有网络命名空间,业务容器重启不影响网络
- 防止僵尸进程(PID 1 负责回收)
3. Pod 完整规格
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156
| apiVersion: v1 kind: Pod metadata: name: my-pod namespace: default labels: app: my-app version: v1 annotations: description: "示例 Pod" spec: nodeName: node-1 nodeSelector: disk: ssd schedulerName: default-scheduler priorityClassName: high-priority securityContext: runAsUser: 1000 runAsGroup: 3000 fsGroup: 2000 runAsNonRoot: true serviceAccountName: my-sa automountServiceAccountToken: false dnsPolicy: ClusterFirst dnsConfig: options: - name: ndots value: "2" hostNetwork: false hostPID: false hostIPC: false terminationGracePeriodSeconds: 30 initContainers: - name: init-db image: busybox command: ['sh', '-c', 'until nc -z db-service 5432; do sleep 1; done'] containers: - name: app image: my-app:v1 imagePullPolicy: IfNotPresent ports: - name: http containerPort: 8080 protocol: TCP env: - name: DB_HOST value: "db-service" - name: DB_PASSWORD valueFrom: secretKeyRef: name: db-secret key: password - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName envFrom: - configMapRef: name: app-config - secretRef: name: app-secret resources: requests: cpu: "250m" memory: "256Mi" limits: cpu: "1" memory: "512Mi" volumeMounts: - name: config mountPath: /etc/config readOnly: true - name: data mountPath: /data startupProbe: httpGet: path: /healthz port: 8080 failureThreshold: 30 periodSeconds: 10 livenessProbe: httpGet: path: /healthz port: 8080 periodSeconds: 10 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: 8080 periodSeconds: 5 lifecycle: postStart: exec: command: ["/bin/sh", "-c", "echo started > /tmp/started"] preStop: exec: command: ["/bin/sh", "-c", "sleep 5 && nginx -s quit"] securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: - ALL add: - NET_BIND_SERVICE volumes: - name: config configMap: name: app-config - name: data emptyDir: {} imagePullSecrets: - name: registry-secret restartPolicy: Always
|
4. Pod 生命周期状态机
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| ┌─────────┐ │ Pending │ 等待调度 / 拉取镜像 / Init 容器运行 └────┬────┘ │ 所有容器启动 ▼ ┌─────────┐ │ Running │ 至少一个容器在运行 └────┬────┘ ┌──────────┼──────────┐ ▼ ▼ ▼ ┌────────┐ ┌─────────┐ ┌──────────┐ │Succeeded│ │ Failed │ │ Unknown │ └────────┘ └─────────┘ └──────────┘ 所有容器 至少一个 节点失联 成功退出 容器失败退出
|
Pod Phase vs Container State
1 2 3 4 5 6 7
| kubectl get pod my-pod
kubectl describe pod my-pod | grep -A 3 "State:"
|
5. Init 容器
Init 容器在业务容器启动前按顺序运行,全部成功后才启动业务容器:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| initContainers:
- name: wait-for-db image: busybox command: ['sh', '-c', 'until nc -z db-service 5432; do echo waiting; sleep 2; done']
- name: db-migrate image: my-app:v1 command: ['./migrate', '--up'] env: - name: DB_URL valueFrom: secretKeyRef: name: db-secret key: url
- name: download-config image: alpine/curl command: ['sh', '-c', 'curl -o /config/app.yaml https://config-server/app.yaml'] volumeMounts: - name: config mountPath: /config
|
Init 容器 vs Sidecar 容器
| 特性 |
Init 容器 |
Sidecar 容器 |
| 运行时机 |
业务容器启动前 |
与业务容器同时运行 |
| 执行顺序 |
串行 |
并行 |
| 完成条件 |
必须成功退出 |
持续运行 |
| 典型用途 |
初始化、等待依赖 |
日志、代理、监控 |
6. Sidecar 模式
Sidecar 是与主容器协作的辅助容器:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| containers:
- name: app image: my-app:v1 volumeMounts: - name: logs mountPath: /var/log/app
- name: log-collector image: fluentd:v1 volumeMounts: - name: logs mountPath: /var/log/app readOnly: true
- name: envoy image: envoyproxy/envoy:v1.28 ports: - containerPort: 15001
volumes: - name: logs emptyDir: {}
|
Kubernetes 1.29+ 原生 Sidecar
1 2 3 4
| initContainers: - name: envoy-proxy image: envoyproxy/envoy:v1.28 restartPolicy: Always
|
7. 资源管理深度解析
7.1 CPU 资源
1 2 3 4 5
| 1 CPU = 1000m(毫核) 0.5 CPU = 500m
CPU requests:调度依据,保证最低 CPU 时间片 CPU limits:通过 cgroups cpu.cfs_quota_us 实现节流
|
1 2 3
| cat /sys/fs/cgroup/cpu/kubepods/pod<uid>/<container-id>/cpu.cfs_quota_us
|
7.2 内存资源
1 2 3
| 内存 requests:调度依据 内存 limits:通过 cgroups memory.limit_in_bytes 实现 超出 limits → OOM Kill(容器被杀死并重启)
|
1 2 3
| dmesg | grep -i "oom" kubectl describe pod <pod> | grep -i "oom\|killed"
|
7.3 扩展资源(GPU 等)
1 2 3 4
| resources: limits: nvidia.com/gpu: 1 hugepages-2Mi: 100Mi
|
8. 存储卷类型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
| volumes:
- name: tmp emptyDir: medium: Memory sizeLimit: 100Mi
- name: host-data hostPath: path: /data type: DirectoryOrCreate
- name: config configMap: name: app-config defaultMode: 0644
- name: certs secret: secretName: tls-certs defaultMode: 0400
- name: data persistentVolumeClaim: claimName: my-pvc
- name: combined projected: sources: - configMap: name: app-config - secret: name: app-secret - serviceAccountToken: path: token expirationSeconds: 3600
|
9. Pod 安全
9.1 Pod Security Standards
1 2 3 4 5 6 7 8 9
| apiVersion: v1 kind: Namespace metadata: name: production labels: pod-security.kubernetes.io/enforce: restricted pod-security.kubernetes.io/warn: restricted pod-security.kubernetes.io/audit: restricted
|
三个安全级别:
privileged:无限制
baseline:防止已知特权提升
restricted:最严格,遵循 Pod 安全最佳实践
9.2 安全上下文最佳实践
1 2 3 4 5 6 7 8 9 10 11 12
| securityContext: runAsNonRoot: true runAsUser: 10001 runAsGroup: 10001 fsGroup: 10001 seccompProfile: type: RuntimeDefault allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: - ALL
|
10. Downward API
将 Pod 自身信息注入到容器中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: CPU_REQUEST valueFrom: resourceFieldRef: containerName: app resource: requests.cpu
|
11. 常见问题排查
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| kubectl describe pod <pod> | grep -A 10 Events
kubectl logs <pod> --previous kubectl describe pod <pod>
kubectl delete pod <pod> --force --grace-period=0
kubectl exec -it <pod> -- bash kubectl debug -it <pod> --image=busybox --target=app
|
12. 总结
Pod 是 Kubernetes 的核心抽象:
- 共享命名空间:pause 容器持有网络/IPC 命名空间
- 生命周期状态机:Pending → Running → Succeeded/Failed
- Init 容器:串行初始化,保证依赖就绪
- Sidecar 模式:日志、代理、监控的标准模式
- 资源管理:requests 用于调度,limits 用于隔离
- 安全上下文:最小权限原则,默认拒绝特权