- Published on
K3s Upgrade Retrospective: I Drove HA and Application Near-Zero Downtime by Leveraging AI for Planning
- Authors
- Name
- Geonhyuk Im
- @GeonHyuk
K3s Upgrade with GitOps + SUC: Zero Downtime (except Postgres database switchover)
TL;DR
- K3s cluster version upgrade:
v1.33.5+k3s1->v1.34.4+k3s1 - By leveraging Codex, the whole upgrade and testing process took about three hours, including pre-upgrade prep, execution, and post-upgrade validation in a five-node cluster.
- No downtime signal was detected for
dev/prodapplications during K3s node upgrade waves. - Brief, expected interruption occurred only during CNPG primary switchover (
~11swhen failing over by deleting primary pod,<4swith manualkubectl cnpg promote). - This DB interruption comes from PostgreSQL primary role transition behavior, not from the K3s upgrade mechanics themselves.
- I leveraged Codex for upgrade-phase planning and current-cluster diagnostics, then I personally reviewed the generated plan, made HA decisions, and executed/validated the upgrade.
Table of Contents
- Why This Process Was Designed This Way
- Upgrade Plan Details
- Preflight Work to Ensure HA During Upgrade
- Execution Timeline (UTC, 2026-03-02)
- Final Result
- What I’d Keep for Future Upgrades and What Could Be Improved
- Main Prompts I Used & Raw Evidence Files
Why This Process Was Designed This Way
This looked like a routine Kubernetes version upgrade, but it became a good test of whether the in-place upgrade process was actually reliable under stateful workloads. I leveraged AI for phase planning and current-state diagnostics, while final HA decisions and execution were done by me using my own operational judgment.
My homelab cluster runs:
- 1 control-plane node + 4 worker nodes
- ArgoCD-managed applications (
dev,prod) and an observability stack including Prometheus, Grafana, and Loki - CloudNativePG (CNPG) clusters
local-pathstorage, which makes node-level behavior important during drains and failovers
The goal was clear: upgrade every node while avoiding user-visible production downtime.
This was my first in-place upgrade for a Kubernetes cluster.
Most of my production upgrade experience was blue-green on EKS, where cutover and rollback mechanics are different from in-place node waves. But since this was my on-premises homelab cluster, blue-green was not an option.
I could have done a more manual process by SSHing into each node and upgrading K3s directly, but that would have been more error-prone and less auditable. Because of that, I designed this upgrade around Rancher System Upgrade Controller (SUC), managed through GitOps, with strict guardrails.
Key decisions:
- Pinned version in Git:
v1.34.4+k3s1to avoid automatic upgrades when newer versions are released concurrency: 1andcordon: truefor bothserver-planandagent-planso that only one node is upgraded at a time and it is cordoned during the process- Label gate for control-plane:
upgrade-server=active - Label gate for workers:
upgrade-wave=active
This gave me controlled rollout waves, explicit checkpoints, and a clear audit trail.
Upgrade Plan Details
igh9410-infra/infrastructure/system-upgrade-controller/plans.yaml:
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: server-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
tolerations:
# Required for clusters where control-plane nodes are tainted NoSchedule.
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
# Backward compatibility for clusters still using master taint.
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
# Manually trigger master upgrade by labeling the control-plane node:
# kubectl label node <control-plane-node> upgrade-server=active --overwrite
- key: upgrade-server
operator: In
values:
- active
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
version: v1.34.4+k3s1
---
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: agent-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
# Upgrade exactly one worker at a time by labeling a single node:
# kubectl label node <node> upgrade-wave=active --overwrite
- key: upgrade-wave
operator: In
values:
- active
prepare:
image: rancher/k3s-upgrade
args:
- prepare
- server-plan
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
version: v1.34.4+k3s1
Brief Summary:
- Scope: SUC-based upgrade on
controlplane+node01..node04, pinned tov1.34.4+k3s1. - Plans are intentionally label-gated (not fully automatic):
server-plan:concurrency: 1,cordon: true, control-plane taint tolerations, trigger labelupgrade-server=activeagent-plan:concurrency: 1,cordon: true,prepare: server-plan, trigger labelupgrade-wave=active
- Rollout flow:
- Sync ArgoCD
system-upgrade-controllerand verify plans/jobs. - Label control-plane (
upgrade-server=active), monitor completion, remove label. - Upgrade workers one by one in order:
node02->node04->node01->node03. - For each worker: add label, monitor, then remove label.
- Sync ArgoCD
- Health checks after each wave: app pods, CNPG cluster readiness/primary state, and prod sync replication state.
- Downtime verification approach: run continuous app probes during all waves and confirm no failed checks.
- Completion criteria: all nodes on
v1.34.4+k3s1,Ready, workloads healthy, CNPG clusters fully ready. - Detailed measured impact: k3s-failover-impact-report.md
Quick Command References:
kubectl -n system-upgrade get plans -o wide
kubectl label node controlplane upgrade-server=active --overwrite
kubectl label node controlplane upgrade-server-
kubectl label node <worker-node> upgrade-wave=active --overwrite
kubectl label node <worker-node> upgrade-wave-
Preflight Work to Ensure HA During Upgrade
Before upgrading worker nodes, I focused on availability posture:
- Added PDBs (
minAvailable: 1) for my applications in thedevandprodnamespaces. - Increased
devapp replicas to 2 - Added anti-affinity/topology spread by hostname
- Fixed CNPG topology key to
kubernetes.io/hostname - Aligned WAL storage class and production sync replication settings
- Ensured all database pods were running on different nodes for better failover behavior
Initially, one of my CNPG replicas was placed on the same node, so I deleted that replica and let CNPG recreate it on a different node to improve HA placement. This matters more with local-path storage: the underlying PV is node-local and tied to the original node via volume node affinity.
With local-path, CNPG (even though it uses CRDs instead of StatefulSets) still depends on node-local PVC/PV data, so a pod recreated on another node cannot mount the old volume.
Execution Timeline (UTC, 2026-03-02)
1) Control-plane wave
I triggered the control-plane plan first via node label.
What went wrong:
apply-server-planstayedPendingbecause the plan did not have tolerations for the control-plane taints.
How I fixed it:
- Added/verified toleration:
node-role.kubernetes.io/control-plane - Added/verified toleration:
node-role.kubernetes.io/master - Re-synced and resumed the wave
Result:
- Control-plane upgraded and came back
Ready.
Reference:
2) Worker waves: node02 -> node04 -> node01
For each worker:
- Add
upgrade-wave=active - Watch SUC jobs/pods and node readiness/version
- Remove label
- Re-check app and DB health before next wave
All three workers upgraded successfully to v1.34.4+k3s1.
Reference:
- igh9410-infra/docs/retrospective/commands-02-worker-node02-wave.md
- igh9410-infra/docs/retrospective/commands-04-worker-node04-wave.md
- igh9410-infra/docs/retrospective/commands-05-worker-node01-wave.md
3) CNPG primary handover drills before node03
Before the final worker wave, I intentionally triggered DB primary handovers to measure risk in advance.
dev-gramnuri-db-cluster
- Full failover phase: ~
3m27s - Primary promotion window: ~
11.4s - App impact:
dev/gramnuri-apishowed 2x500(about1.4sburst)
prod-artskorner-db-cluster
- Full failover phase: ~
16.2s - Promotion window: ~
3.7s - App impact: no observed
500in measured window
prod-gramnuri-db-cluster
- Full failover phase: ~
3m13s(longer due to old primary termination delay) - Promotion window: ~
3.6s - App impact: no observed
500in measured window - POST traffic continued (
200and one429), no DB-error signature - Additional planned switchover to
node04(2026-03-02T12:13:45Z) - Primary election window: ~
3.8s(currentPrimaryTimestamp=2026-03-02T12:13:48.766Z) - Full switchover to healthy: ~
29s(Cluster in healthy stateat2026-03-02T12:14:14Z)
Reference:
- igh9410-infra/docs/retrospective/commands-06-dev-primary-handover-before-node03.md
- igh9410-infra/docs/retrospective/commands-07-prod-artskorner-primary-handover-before-node03.md
- igh9410-infra/docs/retrospective/commands-08-prod-gramnuri-primary-handover-before-node03.md
- igh9410-infra/docs/retrospective/commands-10-prod-gramnuri-node04-switchover.md
- k3s-failover-impact-report.md
4) Final worker wave: node03
After handover validation:
- SUC job start:
2026-03-02T10:37:25Z - SUC job end:
2026-03-02T10:38:29Z - Duration: ~
64s
Observed during this wave (prod-gramnuri focus):
- No
500signal fromprod/gramnuri-apiorprod/gramnuri-web prod-gramnuri-db-clusterprimary remained onnode01- No new CNPG failover/recovery event in that window
Reference:
Final Result
All nodes ended on v1.34.4+k3s1:
controlplanenode01node02node03node04
Observed downtime outcome:
- No app-visible downtime signal was observed in
dev/prodduring K3s node upgrade waves. - Brief interruption was observed only in
dev/gramnuri-apiduring a DB primary switchover drill. - This switchover window is an expected PostgreSQL/CNPG role-transition behavior, not a K3s upgrade-side outage.
One caveat:
- This conclusion is based on observed logs/events and probe traffic. Strong evidence, but still traffic-dependent.
What I’d Keep for Future Upgrades and What Could Be Improved
Things I’d Keep
- Keep upgrades label-gated and wave-based
- Keep versions pinned in Git
- Keep DB handover drills before high-risk node waves
- Keep between-wave health checkpoints mandatory
- Keep command-level evidence logs for every maintenance event
Things I Could Improve
- Postgres primary switchover could be handled with a manual
kubectl cnpg promotecommand instead of deleting the primary pod - Application-level database connection settings, such as retry and connection pool settings, could be optimized
- To simulate Postgres failover impact more accurately, I would create a test API endpoint that executes a write query to the database and run load tests against that endpoint during the db failover window
Main Prompts I Used & Raw Evidence Files
Main Prompts I Used:
- I'm planning to perform on-premise k3s cluster upgrade from v1.33 to v.134 now. There's some Go backend apis and Postgres dbs running using the CloudNativePG. Since this is my homelab, the interruption is acceptable but I want to do it without downtime to simulate the real-world production environment. Help me plan the K8S cluster upgrade. This cluster has single master node, four worker nodes. You can use the kubectl or whatever tools necessary. The context is default. And I want to document the upgrade steps step by step like in Markdown file
- The Applications are deployed using ArgoCD.Inspect the apps directory, include PDB or increase replicas and other necessary stuff using GitOps principle. I will do the sync when it's pushed to main branch
- I don't want to do the manual upgrade but Rancher's System Upgrade Controller. I will install it via ArgoCD. kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/crd.yaml -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml Do not apply that command but get that crd.yaml and save it into argocd/infra-apps Application and save that crd.yaml and system-upgrade-controller.yaml in infrastructure/system-upgrade-controller, make sure that ArgoCD Application for those resources are created under argocd/infra-apps directory
Raw Evidence Files:
- igh9410-infra/docs/retrospective/commands-00-baseline-and-downtime-verification.md
- igh9410-infra/docs/retrospective/commands-01-master-wave-troubleshooting.md
- igh9410-infra/docs/retrospective/commands-02-worker-node02-wave.md
- igh9410-infra/docs/retrospective/commands-03-gitops-repo-and-docs.md
- igh9410-infra/docs/retrospective/commands-04-worker-node04-wave.md
- igh9410-infra/docs/retrospective/commands-05-worker-node01-wave.md
- igh9410-infra/docs/retrospective/commands-06-dev-primary-handover-before-node03.md
- igh9410-infra/docs/retrospective/commands-07-prod-artskorner-primary-handover-before-node03.md
- igh9410-infra/docs/retrospective/commands-08-prod-gramnuri-primary-handover-before-node03.md
- igh9410-infra/docs/retrospective/commands-10-prod-gramnuri-node04-switchover.md
- igh9410-infra/docs/retrospective/commands-09-worker-node03-wave.md
- igh9410-infra/docs/k3s-failover-impact-report.md
GitOps + SUC 활용한 K3s 클러스터 무중단 업그레이드 (데이터베이스 primary switchover 구간 제외)
TL;DR
- K3s 클러스터 버전 업그레이드:
v1.33.5+k3s1->v1.34.4+k3s1 - Codex로 사전 준비부터 실행, 사후 검증까지 진행했고, 5노드 기준 전체 업그레이드/테스트는 약 3시간 걸렸습니다.
- K3s 노드 업그레이드 웨이브 동안
dev/prod애플리케이션 다운타임 신호는 관측되지 않았습니다. - 짧은 중단은 CNPG primary 전환 구간에서만 발생했습니다(삭제 failover 약
11초, 수동kubectl cnpg promote는4초 미만). - 이 중단은 K3s 업그레이드 자체가 아니라 PostgreSQL primary role 전환 특성에서 오는 현상입니다.
- Codex는 업그레이드 단계 설계와 현재 클러스터 상태 진단에 특히 도움이 됐고, 생성된 플랜 검토/HA 판단/실행/검증은 제가 직접 했습니다.
목차
- 왜 절차를 이렇게 설계했는가
- 업그레이드 플랜 상세
- 업그레이드 중 HA 확보를 위한 사전 작업
- 실행 타임라인 (UTC, 2026-03-02)
- 최종 결과
- 다음 업그레이드에서 유지할 점 / 개선할 점
- 주요 프롬프트와 원본 증적 파일
왜 절차를 이렇게 설계했는가
처음엔 쿠버네티스 마이너 버전 올리는 작업처럼 보였는데, 변수는 상태 저장 워크로드에서 인플레이스 업그레이드 방식이 얼마나 안전하게 작동하는지 검증해볼 기회였습니다. AI 보조로 설계/실행했을 때 실제 클러스터에서 어디까지 잘 먹히는지도 궁금했습니다.
제 홈랩 클러스터는 이렇게 구성돼있습니다.
- 컨트롤 플레인 1대 + 워커 4대
- ArgoCD로 운영 중인 앱(
dev,prod) + Prometheus/Grafana/Loki 관측 스택 - CloudNativePG (CNPG) 클러스터
local-path스토리지(드레인/장애조치 시 노드 배치 영향이 큼)
목표는 분명했습니다.
모든 노드를 버전업 하되, 사용자 입장에서 보이는 프로덕션 다운타임은 없게 만드는 것.
이번이 제 첫 in-place Kubernetes 업그레이드였습니다.
기존 프로덕션 업그레이드 경험은 EKS 기반 blue-green이 중심이었고, 이 방식은 in-place 업그레이드와 위험 지점이 다릅니다. 그리고 이번 환경은 제 온프레미스 홈랩이라 blue-green은 현실적으로 선택하기 어려웠습니다.
SSH로 각 노드에 직접 들어가 K3s를 수동 업그레이드할 수도 있었지만, 실수 가능성이 크고 이력 추적도 깔끔하지 않습니다. 그래서 Rancher System Upgrade Controller(SUC)를 GitOps로 운영하는 방식으로 가져갔습니다.
핵심 포인트는 아래와 같습니다.
- Git에서 타겟 버전 고정:
v1.34.4+k3s1 server-plan,agent-plan모두concurrency: 1,cordon: true- 컨트롤 플레인 라벨 게이트:
upgrade-server=active - 워커 라벨 게이트:
upgrade-wave=active
결국 업그레이드는 "한 번에 밀어넣기"가 아니라,
체크포인트가 있는 웨이브 운영 흐름으로 진행했습니다.
업그레이드 플랜 상세
igh9410-infra/infrastructure/system-upgrade-controller/plans.yaml:
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: server-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
tolerations:
# Required for clusters where control-plane nodes are tainted NoSchedule.
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
# Backward compatibility for clusters still using master taint.
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
# Manually trigger master upgrade by labeling the control-plane node:
# kubectl label node <control-plane-node> upgrade-server=active --overwrite
- key: upgrade-server
operator: In
values:
- active
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
version: v1.34.4+k3s1
---
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: agent-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
# Upgrade exactly one worker at a time by labeling a single node:
# kubectl label node <node> upgrade-wave=active --overwrite
- key: upgrade-wave
operator: In
values:
- active
prepare:
image: rancher/k3s-upgrade
args:
- prepare
- server-plan
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
version: v1.34.4+k3s1
요약하면:
- 범위:
controlplane+node01..node04대상 SUC 업그레이드,v1.34.4+k3s1고정 - 플랜은 의도적으로 완전 자동이 아닌 라벨 게이트 방식을 사용.
server-plan:concurrency: 1,cordon: true, control-plane taint toleration,upgrade-server=activeagent-plan:concurrency: 1,cordon: true,prepare: server-plan,upgrade-wave=active
- 롤아웃 흐름:
- ArgoCD
system-upgrade-controller동기화 후 plan/job 확인 - control-plane 라벨링(
upgrade-server=active) -> 완료 모니터링 -> 라벨 제거 - 워커를
node02->node04->node01->node03순으로 1대씩 진행 - 각 워커마다 라벨 부여 -> 모니터링 -> 라벨 제거
- ArgoCD
- 웨이브별 점검: 앱 pod, CNPG readiness/primary, prod sync replication
- 다운타임 검증: 전 웨이브 동안 연속 probe로 실패 여부 확인
- 완료 기준: 전 노드
v1.34.4+k3s1,Ready, 워크로드/DB 상태 정상 - 상세 영향 리포트: k3s-failover-impact-report.md
빠른 명령 참고:
kubectl -n system-upgrade get plans -o wide
kubectl label node controlplane upgrade-server=active --overwrite
kubectl label node controlplane upgrade-server-
kubectl label node <worker-node> upgrade-wave=active --overwrite
kubectl label node <worker-node> upgrade-wave-
업그레이드 중 HA 확보를 위한 사전 작업
워커 노드를 올리기 전에, 가용성 관점에서 먼저 손봤습니다.
dev/prod네임스페이스 애플리케이션에 PDB(minAvailable: 1) 적용dev앱 replicas 2로 상향- 호스트네임 기준 anti-affinity/topology spread 적용
- CNPG topology key를
kubernetes.io/hostname으로 정리 - WAL storage class 및 프로덕션 동기 복제 설정 정비
- DB pod가 같은 노드에 몰리지 않게 분산 배치
초기엔 CNPG replica 하나가 같은 노드에 붙어 있어서, replica pod를 지우고 다른 노드로 다시 붙도록 유도했습니다. local-path는 PV가 노드 로컬 경로에 묶입니다. 그래서 기존 PVC/PV를 참조하는 pod가 다른 노드에 뜨면 이전 볼륨을 마운트할 수 없습니다. 즉 CNPG가 StatefulSet 대신 CRD를 쓰더라도 PVC/PV 기반인 건 같아서 node-local 스토리지 제약이 그대로 걸립니다. 따라서 local-path 환경에서는 이 사전 정리가 사실상 필수였습니다.
실행 타임라인 (UTC, 2026-03-02)
1) 컨트롤 플레인 웨이브
컨트롤 플레인 노드에 라벨을 붙여 먼저 웨이브를 시작했습니다.
문제:
apply-server-plan이Pending에 머물렀습니다- 원인: 컨트롤 플레인 taint toleration 누락
조치:
node-role.kubernetes.io/control-planetoleration 추가/확인node-role.kubernetes.io/mastertoleration 추가/확인- ArgoCD 재동기화 후 재진행
결과:
- 컨트롤 플레인 업그레이드 완료,
Ready복귀
참고:
2) 워커 웨이브: node02 -> node04 -> node01
각 워커는 같은 루틴으로 진행했습니다.
upgrade-wave=active라벨 부여- SUC jobs/pods 및 노드 readiness/version 모니터링
- 완료 후 라벨 제거
- 다음 웨이브 전에 앱/DB 상태 다시 확인
세 워커 모두 v1.34.4+k3s1로 정상 업그레이드했습니다.
참고:
- igh9410-infra/docs/retrospective/commands-02-worker-node02-wave.md
- igh9410-infra/docs/retrospective/commands-04-worker-node04-wave.md
- igh9410-infra/docs/retrospective/commands-05-worker-node01-wave.md
3) node03 전 CNPG Primary Handover 드릴
마지막 워커(node03) 웨이브 전에 DB primary handover를 일부러 먼저 돌려서 리스크를 미리 체크했습니다.
dev-gramnuri-db-cluster
- 전체 failover 구간: 약
3m27s - primary promotion 구간: 약
11.4s - 앱 영향:
dev/gramnuri-api에서5002건(약1.4s버스트)
prod-artskorner-db-cluster
- 전체 failover 구간: 약
16.2s - promotion 구간: 약
3.7s - 앱 영향: 측정 구간 내
500미관측
prod-gramnuri-db-cluster
- 전체 failover 구간: 약
3m13s(기존 primary 종료 지연으로 증가) - promotion 구간: 약
3.6s - 앱 영향: 측정 구간 내
500미관측 - failover 중 POST 트래픽 처리(
200,4291건) - 추가 planned switchover to
node04(2026-03-02T12:13:45Z) - primary election 구간: 약
3.8s(currentPrimaryTimestamp=2026-03-02T12:13:48.766Z) - 클러스터 healthy 복귀까지: 약
29s(2026-03-02T12:14:14Z)
참고:
- igh9410-infra/docs/retrospective/commands-06-dev-primary-handover-before-node03.md
- igh9410-infra/docs/retrospective/commands-07-prod-artskorner-primary-handover-before-node03.md
- igh9410-infra/docs/retrospective/commands-08-prod-gramnuri-primary-handover-before-node03.md
- igh9410-infra/docs/retrospective/commands-10-prod-gramnuri-node04-switchover.md
- k3s-failover-impact-report.md
4) 최종 워커 웨이브: node03
handover 검증 후 node03 웨이브를 진행했습니다.
- SUC job 시작:
2026-03-02T10:37:25Z - SUC job 종료:
2026-03-02T10:38:29Z - 소요 시간: 약
64s
이 구간 관측 결과(prod-gramnuri 중심):
prod/gramnuri-api,prod/gramnuri-web에서500신호 없음prod-gramnuri-db-clusterprimary는node01유지- 해당 구간 신규 CNPG failover/recovery 이벤트 없음
참고:
최종 결과
모든 노드가 v1.34.4+k3s1로 정렬되었습니다.
controlplanenode01node02node03node04
다운타임 관점 결론:
- K3s 노드 업그레이드 웨이브 동안
dev/prod앱 레벨 다운타임 신호는 관측되지 않았습니다. - 앱 레벨에서 명확히 보인 짧은 중단은
dev/gramnuri-api의 DB primary 전환 드릴 구간뿐이었습니다. - 이 구간은 K3s 업그레이드 이슈가 아니라 PostgreSQL/CNPG role 전환 특성에 해당합니다.
- 프로덕션(
gramnuri,artskorner)에서는 측정한 failover/업그레이드 구간에서 사용자 가시적 다운타임 신호가 없었습니다.
참고:
- 위 결론은 로그/이벤트/프로브 관측 기반입니다
- 근거는 충분히 강하지만 트래픽 패턴 의존성은 남아 있습니다
다음 업그레이드에서 유지할 점 / 개선할 점
유지할 점
- 라벨 게이트 + 웨이브 기반 업그레이드 유지
- 타겟 버전 Git 고정 유지
- 고위험 웨이브 전 DB handover 드릴 유지
- 웨이브 간 상태 검증 필수화
- 명령/증적 로그 운영 습관 유지
개선할 점
- primary 삭제 방식 대신 수동
kubectl cnpg promoteswitchover를 우선 적용 - 앱 레벨 DB retry/connection pool 설정 더 다듬기
- failover 윈도우에 write 쿼리 테스트 API + 부하 테스트를 붙여 실측 정확도 올리기
주요 프롬프트와 원본 증적 파일
주요 프롬프트:
- I'm planning to perform on-premise k3s cluster upgrade from v1.33 to v.134 now. There's some Go backend apis and Postgres dbs running using the CloudNativePG. Since this is my homelab, the interruption is acceptable but I want to do it without downtime to simulate the real-world production environment. Help me plan the K8S cluster upgrade. This cluster has single master node, four worker nodes. You can use the kubectl or whatever tools necessary. The context is default. And I want to document the upgrade steps step by step like in Markdown file
- The Applications are deployed using ArgoCD.Inspect the apps directory, include PDB or increase replicas and other necessary stuff using GitOps principle. I will do the sync when it's pushed to main branch
- I don't want to do the manual upgrade but Rancher's System Upgrade Controller. I will install it via ArgoCD. kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/crd.yaml -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml Do not apply that command but get that crd.yaml and save it into argocd/infra-apps Application and save that crd.yaml and system-upgrade-controller.yaml in infrastructure/system-upgrade-controller, make sure that ArgoCD Application for those resources are created under argocd/infra-apps directory
원본 증적 파일:
- igh9410-infra/docs/retrospective/commands-00-baseline-and-downtime-verification.md
- igh9410-infra/docs/retrospective/commands-01-master-wave-troubleshooting.md
- igh9410-infra/docs/retrospective/commands-02-worker-node02-wave.md
- igh9410-infra/docs/retrospective/commands-03-gitops-repo-and-docs.md
- igh9410-infra/docs/retrospective/commands-04-worker-node04-wave.md
- igh9410-infra/docs/retrospective/commands-05-worker-node01-wave.md
- igh9410-infra/docs/retrospective/commands-06-dev-primary-handover-before-node03.md
- igh9410-infra/docs/retrospective/commands-07-prod-artskorner-primary-handover-before-node03.md
- igh9410-infra/docs/retrospective/commands-08-prod-gramnuri-primary-handover-before-node03.md
- igh9410-infra/docs/retrospective/commands-10-prod-gramnuri-node04-switchover.md
- igh9410-infra/docs/retrospective/commands-09-worker-node03-wave.md
- igh9410-infra/docs/k3s-failover-impact-report.md