K3s Upgrade with GitOps + SUC: Zero Downtime (except Postgres database switchover)

TL;DR

K3s cluster version upgrade: v1.33.5+k3s1 -> v1.34.4+k3s1
By leveraging Codex, the whole upgrade and testing process took about three hours, including pre-upgrade prep, execution, and post-upgrade validation in a five-node cluster.
No downtime signal was detected for dev/prod applications during K3s node upgrade waves.
Brief, expected interruption occurred only during CNPG primary switchover (~11s when failing over by deleting primary pod, <4s with manual kubectl cnpg promote).
This DB interruption comes from PostgreSQL primary role transition behavior, not from the K3s upgrade mechanics themselves.
I leveraged Codex for upgrade-phase planning and current-cluster diagnostics, then I personally reviewed the generated plan, made HA decisions, and executed/validated the upgrade.

Why This Process Was Designed This Way
Upgrade Plan Details
Preflight Work to Ensure HA During Upgrade
Execution Timeline (UTC, 2026-03-02)
Final Result
What I’d Keep for Future Upgrades and What Could Be Improved
Main Prompts I Used & Raw Evidence Files

Why This Process Was Designed This Way

This looked like a routine Kubernetes version upgrade, but it became a good test of whether the in-place upgrade process was actually reliable under stateful workloads. I leveraged AI for phase planning and current-state diagnostics, while final HA decisions and execution were done by me using my own operational judgment.

My homelab cluster runs:

1 control-plane node + 4 worker nodes
ArgoCD-managed applications (dev, prod) and an observability stack including Prometheus, Grafana, and Loki
CloudNativePG (CNPG) clusters
local-path storage, which makes node-level behavior important during drains and failovers

The goal was clear: upgrade every node while avoiding user-visible production downtime.

This was my first in-place upgrade for a Kubernetes cluster.
Most of my production upgrade experience was blue-green on EKS, where cutover and rollback mechanics are different from in-place node waves. But since this was my on-premises homelab cluster, blue-green was not an option.

I could have done a more manual process by SSHing into each node and upgrading K3s directly, but that would have been more error-prone and less auditable. Because of that, I designed this upgrade around Rancher System Upgrade Controller (SUC), managed through GitOps, with strict guardrails.

Key decisions:

Pinned version in Git: v1.34.4+k3s1 to avoid automatic upgrades when newer versions are released
concurrency: 1 and cordon: true for both server-plan and agent-plan so that only one node is upgraded at a time and it is cordoned during the process
Label gate for control-plane: upgrade-server=active
Label gate for workers: upgrade-wave=active

This gave me controlled rollout waves, explicit checkpoints, and a clear audit trail.

Upgrade Plan Details

igh9410-infra/infrastructure/system-upgrade-controller/plans.yaml:

apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: server-plan
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  tolerations:
    # Required for clusters where control-plane nodes are tainted NoSchedule.
    - key: node-role.kubernetes.io/control-plane
      operator: Exists
      effect: NoSchedule
    # Backward compatibility for clusters still using master taint.
    - key: node-role.kubernetes.io/master
      operator: Exists
      effect: NoSchedule
  nodeSelector:
    matchExpressions:
      - key: node-role.kubernetes.io/control-plane
        operator: Exists
      # Manually trigger master upgrade by labeling the control-plane node:
      # kubectl label node <control-plane-node> upgrade-server=active --overwrite
      - key: upgrade-server
        operator: In
        values:
          - active
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
  version: v1.34.4+k3s1
---
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: agent-plan
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
      - key: node-role.kubernetes.io/control-plane
        operator: DoesNotExist
      # Upgrade exactly one worker at a time by labeling a single node:
      # kubectl label node <node> upgrade-wave=active --overwrite
      - key: upgrade-wave
        operator: In
        values:
          - active
  prepare:
    image: rancher/k3s-upgrade
    args:
      - prepare
      - server-plan
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
  version: v1.34.4+k3s1

Brief Summary:

Scope: SUC-based upgrade on controlplane + node01..node04, pinned to v1.34.4+k3s1.
Plans are intentionally label-gated (not fully automatic):
- server-plan: concurrency: 1, cordon: true, control-plane taint tolerations, trigger label upgrade-server=active
- agent-plan: concurrency: 1, cordon: true, prepare: server-plan, trigger label upgrade-wave=active
Rollout flow:
1. Sync ArgoCD system-upgrade-controller and verify plans/jobs.
2. Label control-plane (upgrade-server=active), monitor completion, remove label.
3. Upgrade workers one by one in order: node02 -> node04 -> node01 -> node03.
4. For each worker: add label, monitor, then remove label.
Health checks after each wave: app pods, CNPG cluster readiness/primary state, and prod sync replication state.
Downtime verification approach: run continuous app probes during all waves and confirm no failed checks.
Completion criteria: all nodes on v1.34.4+k3s1, Ready, workloads healthy, CNPG clusters fully ready.
Detailed measured impact: k3s-failover-impact-report.md

Quick Command References:

kubectl -n system-upgrade get plans -o wide
kubectl label node controlplane upgrade-server=active --overwrite
kubectl label node controlplane upgrade-server-
kubectl label node <worker-node> upgrade-wave=active --overwrite
kubectl label node <worker-node> upgrade-wave-

Preflight Work to Ensure HA During Upgrade

Before upgrading worker nodes, I focused on availability posture:

Added PDBs (minAvailable: 1) for my applications in the dev and prod namespaces.
Increased dev app replicas to 2
Added anti-affinity/topology spread by hostname
Fixed CNPG topology key to kubernetes.io/hostname
Aligned WAL storage class and production sync replication settings
Ensured all database pods were running on different nodes for better failover behavior

Initially, one of my CNPG replicas was placed on the same node, so I deleted that replica and let CNPG recreate it on a different node to improve HA placement. This matters more with local-path storage: the underlying PV is node-local and tied to the original node via volume node affinity.
With local-path, CNPG (even though it uses CRDs instead of StatefulSets) still depends on node-local PVC/PV data, so a pod recreated on another node cannot mount the old volume.

Execution Timeline (UTC, 2026-03-02)

1) Control-plane wave

I triggered the control-plane plan first via node label.

What went wrong:

apply-server-plan stayed Pending because the plan did not have tolerations for the control-plane taints.

How I fixed it:

Added/verified toleration: node-role.kubernetes.io/control-plane
Added/verified toleration: node-role.kubernetes.io/master
Re-synced and resumed the wave

Result:

Control-plane upgraded and came back Ready.

Reference:

igh9410-infra/docs/retrospective/commands-01-master-wave-troubleshooting.md

2) Worker waves: `node02` -> `node04` -> `node01`

For each worker:

Add upgrade-wave=active
Watch SUC jobs/pods and node readiness/version
Remove label
Re-check app and DB health before next wave

All three workers upgraded successfully to v1.34.4+k3s1.

Reference:

3) CNPG primary handover drills before `node03`

Before the final worker wave, I intentionally triggered DB primary handovers to measure risk in advance.

dev-gramnuri-db-cluster

Full failover phase: ~3m27s
Primary promotion window: ~11.4s
App impact: dev/gramnuri-api showed 2x 500 (about 1.4s burst)

prod-artskorner-db-cluster

Full failover phase: ~16.2s
Promotion window: ~3.7s
App impact: no observed 500 in measured window

prod-gramnuri-db-cluster

Full failover phase: ~3m13s (longer due to old primary termination delay)
Promotion window: ~3.6s
App impact: no observed 500 in measured window
POST traffic continued (200 and one 429), no DB-error signature
Additional planned switchover to node04 (2026-03-02T12:13:45Z)
Primary election window: ~3.8s (currentPrimaryTimestamp=2026-03-02T12:13:48.766Z)
Full switchover to healthy: ~29s (Cluster in healthy state at 2026-03-02T12:14:14Z)

Reference:

4) Final worker wave: `node03`

After handover validation:

SUC job start: 2026-03-02T10:37:25Z
SUC job end: 2026-03-02T10:38:29Z
Duration: ~64s

Observed during this wave (prod-gramnuri focus):

No 500 signal from prod/gramnuri-api or prod/gramnuri-web
prod-gramnuri-db-cluster primary remained on node01
No new CNPG failover/recovery event in that window

Reference:

igh9410-infra/docs/retrospective/commands-09-worker-node03-wave.md

Final Result

All nodes ended on v1.34.4+k3s1:

controlplane
node01
node02
node03
node04

Observed downtime outcome:

No app-visible downtime signal was observed in dev/prod during K3s node upgrade waves.
Brief interruption was observed only in dev/gramnuri-api during a DB primary switchover drill.
This switchover window is an expected PostgreSQL/CNPG role-transition behavior, not a K3s upgrade-side outage.

One caveat:

This conclusion is based on observed logs/events and probe traffic. Strong evidence, but still traffic-dependent.

What I’d Keep for Future Upgrades and What Could Be Improved

Things I’d Keep

Keep upgrades label-gated and wave-based
Keep versions pinned in Git
Keep DB handover drills before high-risk node waves
Keep between-wave health checkpoints mandatory
Keep command-level evidence logs for every maintenance event

Things I Could Improve

Postgres primary switchover could be handled with a manual kubectl cnpg promote command instead of deleting the primary pod
Application-level database connection settings, such as retry and connection pool settings, could be optimized
To simulate Postgres failover impact more accurately, I would create a test API endpoint that executes a write query to the database and run load tests against that endpoint during the db failover window

Main Prompts I Used & Raw Evidence Files

Main Prompts I Used:

I'm planning to perform on-premise k3s cluster upgrade from v1.33 to v.134 now. There's some Go backend apis and Postgres dbs running using the CloudNativePG. Since this is my homelab, the interruption is acceptable but I want to do it without downtime to simulate the real-world production environment. Help me plan the K8S cluster upgrade. This cluster has single master node, four worker nodes. You can use the kubectl or whatever tools necessary. The context is default. And I want to document the upgrade steps step by step like in Markdown file
The Applications are deployed using ArgoCD.Inspect the apps directory, include PDB or increase replicas and other necessary stuff using GitOps principle. I will do the sync when it's pushed to main branch
I don't want to do the manual upgrade but Rancher's System Upgrade Controller. I will install it via ArgoCD. kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/crd.yaml -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml Do not apply that command but get that crd.yaml and save it into argocd/infra-apps Application and save that crd.yaml and system-upgrade-controller.yaml in infrastructure/system-upgrade-controller, make sure that ArgoCD Application for those resources are created under argocd/infra-apps directory

Raw Evidence Files:

GitOps + SUC 활용한 K3s 클러스터 무중단 업그레이드 (데이터베이스 primary switchover 구간 제외)

TL;DR

K3s 클러스터 버전 업그레이드: v1.33.5+k3s1 -> v1.34.4+k3s1
Codex로 사전 준비부터 실행, 사후 검증까지 진행했고, 5노드 기준 전체 업그레이드/테스트는 약 3시간 걸렸습니다.
K3s 노드 업그레이드 웨이브 동안 dev/prod 애플리케이션 다운타임 신호는 관측되지 않았습니다.
짧은 중단은 CNPG primary 전환 구간에서만 발생했습니다(삭제 failover 약 11초, 수동 kubectl cnpg promote는 4초 미만).
이 중단은 K3s 업그레이드 자체가 아니라 PostgreSQL primary role 전환 특성에서 오는 현상입니다.
Codex는 업그레이드 단계 설계와 현재 클러스터 상태 진단에 특히 도움이 됐고, 생성된 플랜 검토/HA 판단/실행/검증은 제가 직접 했습니다.

왜 절차를 이렇게 설계했는가

처음엔 쿠버네티스 마이너 버전 올리는 작업처럼 보였는데, 변수는 상태 저장 워크로드에서 인플레이스 업그레이드 방식이 얼마나 안전하게 작동하는지 검증해볼 기회였습니다. AI 보조로 설계/실행했을 때 실제 클러스터에서 어디까지 잘 먹히는지도 궁금했습니다.

제 홈랩 클러스터는 이렇게 구성돼있습니다.

컨트롤 플레인 1대 + 워커 4대
ArgoCD로 운영 중인 앱(dev, prod) + Prometheus/Grafana/Loki 관측 스택
CloudNativePG (CNPG) 클러스터
local-path 스토리지(드레인/장애조치 시 노드 배치 영향이 큼)

목표는 분명했습니다.
모든 노드를 버전업 하되, 사용자 입장에서 보이는 프로덕션 다운타임은 없게 만드는 것.

이번이 제 첫 in-place Kubernetes 업그레이드였습니다.
기존 프로덕션 업그레이드 경험은 EKS 기반 blue-green이 중심이었고, 이 방식은 in-place 업그레이드와 위험 지점이 다릅니다. 그리고 이번 환경은 제 온프레미스 홈랩이라 blue-green은 현실적으로 선택하기 어려웠습니다.

SSH로 각 노드에 직접 들어가 K3s를 수동 업그레이드할 수도 있었지만, 실수 가능성이 크고 이력 추적도 깔끔하지 않습니다. 그래서 Rancher System Upgrade Controller(SUC)를 GitOps로 운영하는 방식으로 가져갔습니다.

핵심 포인트는 아래와 같습니다.

Git에서 타겟 버전 고정: v1.34.4+k3s1
server-plan, agent-plan 모두 concurrency: 1, cordon: true
컨트롤 플레인 라벨 게이트: upgrade-server=active
워커 라벨 게이트: upgrade-wave=active

결국 업그레이드는 "한 번에 밀어넣기"가 아니라,
체크포인트가 있는 웨이브 운영 흐름으로 진행했습니다.

업그레이드 플랜 상세

igh9410-infra/infrastructure/system-upgrade-controller/plans.yaml:

apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: server-plan
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  tolerations:
    # Required for clusters where control-plane nodes are tainted NoSchedule.
    - key: node-role.kubernetes.io/control-plane
      operator: Exists
      effect: NoSchedule
    # Backward compatibility for clusters still using master taint.
    - key: node-role.kubernetes.io/master
      operator: Exists
      effect: NoSchedule
  nodeSelector:
    matchExpressions:
      - key: node-role.kubernetes.io/control-plane
        operator: Exists
      # Manually trigger master upgrade by labeling the control-plane node:
      # kubectl label node <control-plane-node> upgrade-server=active --overwrite
      - key: upgrade-server
        operator: In
        values:
          - active
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
  version: v1.34.4+k3s1
---
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: agent-plan
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
      - key: node-role.kubernetes.io/control-plane
        operator: DoesNotExist
      # Upgrade exactly one worker at a time by labeling a single node:
      # kubectl label node <node> upgrade-wave=active --overwrite
      - key: upgrade-wave
        operator: In
        values:
          - active
  prepare:
    image: rancher/k3s-upgrade
    args:
      - prepare
      - server-plan
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
  version: v1.34.4+k3s1

요약하면:

범위: controlplane + node01..node04 대상 SUC 업그레이드, v1.34.4+k3s1 고정
플랜은 의도적으로 완전 자동이 아닌 라벨 게이트 방식을 사용.
- server-plan: concurrency: 1, cordon: true, control-plane taint toleration, upgrade-server=active
- agent-plan: concurrency: 1, cordon: true, prepare: server-plan, upgrade-wave=active
롤아웃 흐름:
1. ArgoCD system-upgrade-controller 동기화 후 plan/job 확인
2. control-plane 라벨링(upgrade-server=active) -> 완료 모니터링 -> 라벨 제거
3. 워커를 node02 -> node04 -> node01 -> node03 순으로 1대씩 진행
4. 각 워커마다 라벨 부여 -> 모니터링 -> 라벨 제거
웨이브별 점검: 앱 pod, CNPG readiness/primary, prod sync replication
다운타임 검증: 전 웨이브 동안 연속 probe로 실패 여부 확인
완료 기준: 전 노드 v1.34.4+k3s1, Ready, 워크로드/DB 상태 정상
상세 영향 리포트: k3s-failover-impact-report.md

빠른 명령 참고:

kubectl -n system-upgrade get plans -o wide
kubectl label node controlplane upgrade-server=active --overwrite
kubectl label node controlplane upgrade-server-
kubectl label node <worker-node> upgrade-wave=active --overwrite
kubectl label node <worker-node> upgrade-wave-

업그레이드 중 HA 확보를 위한 사전 작업

워커 노드를 올리기 전에, 가용성 관점에서 먼저 손봤습니다.

dev/prod 네임스페이스 애플리케이션에 PDB(minAvailable: 1) 적용
dev 앱 replicas 2로 상향
호스트네임 기준 anti-affinity/topology spread 적용
CNPG topology key를 kubernetes.io/hostname으로 정리
WAL storage class 및 프로덕션 동기 복제 설정 정비
DB pod가 같은 노드에 몰리지 않게 분산 배치

초기엔 CNPG replica 하나가 같은 노드에 붙어 있어서, replica pod를 지우고 다른 노드로 다시 붙도록 유도했습니다. local-path는 PV가 노드 로컬 경로에 묶입니다. 그래서 기존 PVC/PV를 참조하는 pod가 다른 노드에 뜨면 이전 볼륨을 마운트할 수 없습니다. 즉 CNPG가 StatefulSet 대신 CRD를 쓰더라도 PVC/PV 기반인 건 같아서 node-local 스토리지 제약이 그대로 걸립니다. 따라서 local-path 환경에서는 이 사전 정리가 사실상 필수였습니다.

실행 타임라인 (UTC, 2026-03-02)

1) 컨트롤 플레인 웨이브

컨트롤 플레인 노드에 라벨을 붙여 먼저 웨이브를 시작했습니다.

문제:

apply-server-plan이 Pending에 머물렀습니다
원인: 컨트롤 플레인 taint toleration 누락

조치:

node-role.kubernetes.io/control-plane toleration 추가/확인
node-role.kubernetes.io/master toleration 추가/확인
ArgoCD 재동기화 후 재진행

결과:

컨트롤 플레인 업그레이드 완료, Ready 복귀

참고:

igh9410-infra/docs/retrospective/commands-01-master-wave-troubleshooting.md

2) 워커 웨이브: `node02` -> `node04` -> `node01`

각 워커는 같은 루틴으로 진행했습니다.

upgrade-wave=active 라벨 부여
SUC jobs/pods 및 노드 readiness/version 모니터링
완료 후 라벨 제거
다음 웨이브 전에 앱/DB 상태 다시 확인

세 워커 모두 v1.34.4+k3s1로 정상 업그레이드했습니다.

참고:

3) `node03` 전 CNPG Primary Handover 드릴

마지막 워커(node03) 웨이브 전에 DB primary handover를 일부러 먼저 돌려서 리스크를 미리 체크했습니다.

dev-gramnuri-db-cluster

전체 failover 구간: 약 3m27s
primary promotion 구간: 약 11.4s
앱 영향: dev/gramnuri-api에서 500 2건(약 1.4s 버스트)

prod-artskorner-db-cluster

전체 failover 구간: 약 16.2s
promotion 구간: 약 3.7s
앱 영향: 측정 구간 내 500 미관측

prod-gramnuri-db-cluster

전체 failover 구간: 약 3m13s (기존 primary 종료 지연으로 증가)
promotion 구간: 약 3.6s
앱 영향: 측정 구간 내 500 미관측
failover 중 POST 트래픽 처리(200, 429 1건)
추가 planned switchover to node04 (2026-03-02T12:13:45Z)
primary election 구간: 약 3.8s (currentPrimaryTimestamp=2026-03-02T12:13:48.766Z)
클러스터 healthy 복귀까지: 약 29s (2026-03-02T12:14:14Z)

참고:

4) 최종 워커 웨이브: `node03`

handover 검증 후 node03 웨이브를 진행했습니다.

SUC job 시작: 2026-03-02T10:37:25Z
SUC job 종료: 2026-03-02T10:38:29Z
소요 시간: 약 64s

이 구간 관측 결과(prod-gramnuri 중심):

prod/gramnuri-api, prod/gramnuri-web에서 500 신호 없음
prod-gramnuri-db-cluster primary는 node01 유지
해당 구간 신규 CNPG failover/recovery 이벤트 없음

참고:

igh9410-infra/docs/retrospective/commands-09-worker-node03-wave.md

최종 결과

모든 노드가 v1.34.4+k3s1로 정렬되었습니다.

controlplane
node01
node02
node03
node04

다운타임 관점 결론:

K3s 노드 업그레이드 웨이브 동안 dev/prod 앱 레벨 다운타임 신호는 관측되지 않았습니다.
앱 레벨에서 명확히 보인 짧은 중단은 dev/gramnuri-api의 DB primary 전환 드릴 구간뿐이었습니다.
이 구간은 K3s 업그레이드 이슈가 아니라 PostgreSQL/CNPG role 전환 특성에 해당합니다.
프로덕션(gramnuri, artskorner)에서는 측정한 failover/업그레이드 구간에서 사용자 가시적 다운타임 신호가 없었습니다.

참고:

위 결론은 로그/이벤트/프로브 관측 기반입니다
근거는 충분히 강하지만 트래픽 패턴 의존성은 남아 있습니다

다음 업그레이드에서 유지할 점 / 개선할 점

유지할 점

라벨 게이트 + 웨이브 기반 업그레이드 유지
타겟 버전 Git 고정 유지
고위험 웨이브 전 DB handover 드릴 유지
웨이브 간 상태 검증 필수화
명령/증적 로그 운영 습관 유지

개선할 점

primary 삭제 방식 대신 수동 kubectl cnpg promote switchover를 우선 적용
앱 레벨 DB retry/connection pool 설정 더 다듬기
failover 윈도우에 write 쿼리 테스트 API + 부하 테스트를 붙여 실측 정확도 올리기

주요 프롬프트와 원본 증적 파일

주요 프롬프트:

I'm planning to perform on-premise k3s cluster upgrade from v1.33 to v.134 now. There's some Go backend apis and Postgres dbs running using the CloudNativePG. Since this is my homelab, the interruption is acceptable but I want to do it without downtime to simulate the real-world production environment. Help me plan the K8S cluster upgrade. This cluster has single master node, four worker nodes. You can use the kubectl or whatever tools necessary. The context is default. And I want to document the upgrade steps step by step like in Markdown file
The Applications are deployed using ArgoCD.Inspect the apps directory, include PDB or increase replicas and other necessary stuff using GitOps principle. I will do the sync when it's pushed to main branch
I don't want to do the manual upgrade but Rancher's System Upgrade Controller. I will install it via ArgoCD. kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/crd.yaml -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml Do not apply that command but get that crd.yaml and save it into argocd/infra-apps Application and save that crd.yaml and system-upgrade-controller.yaml in infrastructure/system-upgrade-controller, make sure that ArgoCD Application for those resources are created under argocd/infra-apps directory

K3s Upgrade Retrospective: I Drove HA and Application Near-Zero Downtime by Leveraging AI for Planning

K3s Upgrade with GitOps + SUC: Zero Downtime (except Postgres database switchover)

TL;DR

Table of Contents

Why This Process Was Designed This Way

Upgrade Plan Details

Preflight Work to Ensure HA During Upgrade

Execution Timeline (UTC, 2026-03-02)

1) Control-plane wave

2) Worker waves: `node02` -> `node04` -> `node01`

3) CNPG primary handover drills before `node03`

4) Final worker wave: `node03`

Final Result

What I’d Keep for Future Upgrades and What Could Be Improved

Things I’d Keep

Things I Could Improve

Main Prompts I Used & Raw Evidence Files

Main Prompts I Used:

Raw Evidence Files:

GitOps + SUC 활용한 K3s 클러스터 무중단 업그레이드 (데이터베이스 primary switchover 구간 제외)

TL;DR

목차

왜 절차를 이렇게 설계했는가

업그레이드 플랜 상세

업그레이드 중 HA 확보를 위한 사전 작업

실행 타임라인 (UTC, 2026-03-02)

1) 컨트롤 플레인 웨이브

2) 워커 웨이브: `node02` -> `node04` -> `node01`

3) `node03` 전 CNPG Primary Handover 드릴

4) 최종 워커 웨이브: `node03`

최종 결과

다음 업그레이드에서 유지할 점 / 개선할 점

유지할 점

개선할 점

주요 프롬프트와 원본 증적 파일

주요 프롬프트:

원본 증적 파일:

K3s Upgrade with GitOps + SUC: Zero Downtime (except Postgres database switchover)

TL;DR

Table of Contents

Why This Process Was Designed This Way

Upgrade Plan Details

Preflight Work to Ensure HA During Upgrade

Execution Timeline (UTC, 2026-03-02)

1) Control-plane wave

2) Worker waves: node02 -> node04 -> node01

3) CNPG primary handover drills before node03

4) Final worker wave: node03

Final Result

What I’d Keep for Future Upgrades and What Could Be Improved

Things I’d Keep

Things I Could Improve

Main Prompts I Used & Raw Evidence Files

Main Prompts I Used:

Raw Evidence Files:

GitOps + SUC 활용한 K3s 클러스터 무중단 업그레이드 (데이터베이스 primary switchover 구간 제외)

TL;DR

목차

왜 절차를 이렇게 설계했는가

업그레이드 플랜 상세

업그레이드 중 HA 확보를 위한 사전 작업

실행 타임라인 (UTC, 2026-03-02)

1) 컨트롤 플레인 웨이브

2) 워커 웨이브: node02 -> node04 -> node01

3) node03 전 CNPG Primary Handover 드릴

4) 최종 워커 웨이브: node03

최종 결과

다음 업그레이드에서 유지할 점 / 개선할 점

유지할 점

개선할 점

주요 프롬프트와 원본 증적 파일

주요 프롬프트:

원본 증적 파일:

2) Worker waves: `node02` -> `node04` -> `node01`

3) CNPG primary handover drills before `node03`

4) Final worker wave: `node03`

2) 워커 웨이브: `node02` -> `node04` -> `node01`

3) `node03` 전 CNPG Primary Handover 드릴

4) 최종 워커 웨이브: `node03`