Safe Worker Node Upgrade Procedure¶

Overview¶

This procedure allows you to upgrade worker nodes one at a time to apply the new disk partitioning for rook-ceph without disrupting your running cluster. The control plane nodes remain untouched.

Prerequisites¶

Kubernetes cluster is healthy
Control plane nodes (control0, control1) are running normally
You have physical or remote access to each worker node
Workloads can tolerate one node being down at a time

Important Notes¶

⚠️ This is a destructive operation for each worker node - all data on the worker will be wiped ⚠️ Only affects worker nodes - control plane remains running ⚠️ Process one worker at a time - ensures cluster availability

Step-by-Step Procedure¶

Pre-Flight Checks¶

# Verify cluster health
kubectl get nodes
kubectl get pods -A | grep -v Running | grep -v Completed

# Verify you have 3 healthy workers
talosctl --nodes 10.0.50.10,10.0.50.11,10.0.50.12 version

# Check what's running on each worker
kubectl get pods -A -o wide | grep worker0
kubectl get pods -A -o wide | grep worker1
kubectl get pods -A -o wide | grep worker2

For Each Worker Node (Repeat 3 times)¶

Worker 0 (10.0.50.10)¶

Step 1: Cordon the node

kubectl cordon worker0

Step 2: Drain the node (gracefully evict all pods)

kubectl drain worker0 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=300 \
  --timeout=600s

Step 3: Verify pods migrated

# Should show no pods (except DaemonSets)
kubectl get pods -A -o wide | grep worker0

Step 4: Reset ONLY this worker node

talosctl --nodes 10.0.50.10 reset \
  --graceful=false \
  --reboot \
  --system-labels-to-wipe STATE \
  --system-labels-to-wipe EPHEMERAL

Step 5: Wait for node to reboot into maintenance mode

# This may take 2-5 minutes
# Check when it's responding to ping
ping 10.0.50.10

Step 6: Apply the new configuration with disk partitioning

cd talos
talhelper gencommand apply \
  --node 10.0.50.10 \
  --extra-flags="--insecure" \
  | bash

Step 7: Wait for node to join cluster

# Watch node status (wait for Ready)
watch kubectl get nodes

# Alternative: use this to wait
kubectl wait --for=condition=Ready node/worker0 --timeout=600s

Step 8: Verify the new disk layout

talosctl --nodes 10.0.50.10 get discoveredvolumes | grep nvme0n1
# You should see nvme0n1p5 now (the new 100GB partition)

# Verify mount
talosctl --nodes 10.0.50.10 df | grep rook

Step 9: Uncordon the node

kubectl uncordon worker0

Step 10: Verify cluster health

kubectl get nodes
kubectl get pods -A | grep -v Running | grep -v Completed

# Wait for pods to redistribute
sleep 60

Worker 1 (10.0.50.11)¶

Repeat the same steps for worker1:

# Step 1: Cordon
kubectl cordon worker1

# Step 2: Drain
kubectl drain worker1 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=300 \
  --timeout=600s

# Step 3: Verify
kubectl get pods -A -o wide | grep worker1

# Step 4: Reset
talosctl --nodes 10.0.50.11 reset \
  --graceful=false \
  --reboot \
  --system-labels-to-wipe STATE \
  --system-labels-to-wipe EPHEMERAL

# Step 5: Wait for reboot
ping 10.0.50.11

# Step 6: Apply config
cd talos
talhelper gencommand apply \
  --node 10.0.50.11 \
  --extra-flags="--insecure" \
  | bash

# Step 7: Wait for Ready
kubectl wait --for=condition=Ready node/worker1 --timeout=600s

# Step 8: Verify disk
talosctl --nodes 10.0.50.11 get discoveredvolumes | grep nvme0n1

# Step 9: Uncordon
kubectl uncordon worker1

# Step 10: Health check
kubectl get nodes
sleep 60

Worker 2 (10.0.50.12)¶

Repeat the same steps for worker2:

# Step 1: Cordon
kubectl cordon worker2

# Step 2: Drain
kubectl drain worker2 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=300 \
  --timeout=600s

# Step 3: Verify
kubectl get pods -A -o wide | grep worker2

# Step 4: Reset
talosctl --nodes 10.0.50.12 reset \
  --graceful=false \
  --reboot \
  --system-labels-to-wipe STATE \
  --system-labels-to-wipe EPHEMERAL

# Step 5: Wait for reboot
ping 10.0.50.12

# Step 6: Apply config
cd talos
talhelper gencommand apply \
  --node 10.0.50.12 \
  --extra-flags="--insecure" \
  | bash

# Step 7: Wait for Ready
kubectl wait --for=condition=Ready node/worker2 --timeout=600s

# Step 8: Verify disk
talosctl --nodes 10.0.50.12 get discoveredvolumes | grep nvme0n1

# Step 9: Uncordon
kubectl uncordon worker2

# Step 10: Health check
kubectl get nodes

Post-Upgrade Verification¶

Verify all workers have the new disk layout:

for node in 10.0.50.10 10.0.50.11 10.0.50.12; do
  echo "=== Node $node ==="
  talosctl --nodes $node get discoveredvolumes | grep nvme0n1p5
done

Verify all nodes are Ready:

kubectl get nodes
# All 5 nodes (2 control + 3 worker) should be Ready

Commit the configuration:

git add -A
git commit -m "feat: add rook-ceph with dedicated storage partitions on workers"
git push

Deploy Rook-Ceph¶

After all workers are upgraded, Flux will automatically deploy rook-ceph:

# Force Flux reconciliation
task reconcile

# Watch rook-ceph deployment
watch kubectl get pods -n rook-ceph

# Monitor Flux HelmReleases
flux get hr -A

# Check Ceph cluster formation (may take 5-10 minutes)
kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph get cephblockpool

Troubleshooting¶

Node stuck in "NotReady" after config apply¶

# Check node logs
talosctl --nodes <NODE_IP> dmesg | tail -50

# Check kubelet status
talosctl --nodes <NODE_IP> service kubelet status

# If needed, reboot manually
talosctl --nodes <NODE_IP> reboot

Drain hangs or times out¶

# Identify stuck pods
kubectl get pods -A -o wide | grep <NODE_NAME>

# Force delete stuck pods (last resort)
kubectl delete pod <POD_NAME> -n <NAMESPACE> --grace-period=0 --force

# Then re-run drain
kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data --force

Node can't rejoin cluster¶

# Check if node can reach control plane
talosctl --nodes <NODE_IP> get members

# Verify network connectivity
talosctl --nodes <NODE_IP> get links

# Check certificates
talosctl --nodes <NODE_IP> get certificates

# Re-apply config if needed
cd talos
talhelper gencommand apply --node <NODE_IP> --extra-flags="--insecure" | bash

Partition not created correctly¶

# Check actual disk layout
talosctl --nodes <NODE_IP> disks

# Check discovered volumes
talosctl --nodes <NODE_IP> get discoveredvolumes

# If partition is missing, the node needs to be reset again:
talosctl --nodes <NODE_IP> reset --graceful=false --reboot \
  --system-labels-to-wipe STATE --system-labels-to-wipe EPHEMERAL

Rollback Procedure¶

If you need to rollback a worker before completing all three:

Remove the worker patch from talconfig.yaml:

# Edit talos/talconfig.yaml and remove:
# worker:
#   patches:
#     - "@./patches/worker/machine-disks.yaml"

Regenerate configs:
```
task talos:generate-config
```

Reset and re-apply the node:

talosctl --nodes <NODE_IP> reset --graceful=false --reboot
# Wait for maintenance mode
talhelper gencommand apply --node <NODE_IP> --extra-flags="--insecure" | bash

Timeline Estimate¶

Per worker node: ~10-15 minutes
Drain: 2-5 minutes
Reset & reboot: 2-3 minutes
Config apply: 1-2 minutes
Node Ready: 2-3 minutes
Pod rescheduling: 2-3 minutes
Total for all 3 workers: ~30-45 minutes

Safety Considerations¶

✅ Safe: - Control plane remains fully operational - 2 out of 3 workers available during each upgrade - Workloads automatically reschedule to healthy nodes - Can pause between workers to monitor stability

⚠️ Risks: - Workloads without multi-replica deployments may experience downtime - Pods using local storage (emptyDir, hostPath) will be terminated - StatefulSets may be disrupted if they don't have pod anti-affinity

🔧 Recommendations: - Perform during maintenance window if possible - Ensure critical workloads have multiple replicas - Monitor cluster closely during each worker upgrade - Wait and verify stability before proceeding to next worker