Safe Worker Node Upgrade Procedure¶
Overview¶
This procedure allows you to upgrade worker nodes one at a time to apply the new disk partitioning for rook-ceph without disrupting your running cluster. The control plane nodes remain untouched.
Prerequisites¶
- Kubernetes cluster is healthy
- Control plane nodes (control0, control1) are running normally
- You have physical or remote access to each worker node
- Workloads can tolerate one node being down at a time
Important Notes¶
⚠️ This is a destructive operation for each worker node - all data on the worker will be wiped ⚠️ Only affects worker nodes - control plane remains running ⚠️ Process one worker at a time - ensures cluster availability
Step-by-Step Procedure¶
Pre-Flight Checks¶
# Verify cluster health
kubectl get nodes
kubectl get pods -A | grep -v Running | grep -v Completed
# Verify you have 3 healthy workers
talosctl --nodes 10.0.50.10,10.0.50.11,10.0.50.12 version
# Check what's running on each worker
kubectl get pods -A -o wide | grep worker0
kubectl get pods -A -o wide | grep worker1
kubectl get pods -A -o wide | grep worker2
For Each Worker Node (Repeat 3 times)¶
Worker 0 (10.0.50.10)¶
Step 1: Cordon the node
Step 2: Drain the node (gracefully evict all pods)
kubectl drain worker0 \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--grace-period=300 \
--timeout=600s
Step 3: Verify pods migrated
Step 4: Reset ONLY this worker node
talosctl --nodes 10.0.50.10 reset \
--graceful=false \
--reboot \
--system-labels-to-wipe STATE \
--system-labels-to-wipe EPHEMERAL
Step 5: Wait for node to reboot into maintenance mode
Step 6: Apply the new configuration with disk partitioning
Step 7: Wait for node to join cluster
# Watch node status (wait for Ready)
watch kubectl get nodes
# Alternative: use this to wait
kubectl wait --for=condition=Ready node/worker0 --timeout=600s
Step 8: Verify the new disk layout
talosctl --nodes 10.0.50.10 get discoveredvolumes | grep nvme0n1
# You should see nvme0n1p5 now (the new 100GB partition)
# Verify mount
talosctl --nodes 10.0.50.10 df | grep rook
Step 9: Uncordon the node
Step 10: Verify cluster health
kubectl get nodes
kubectl get pods -A | grep -v Running | grep -v Completed
# Wait for pods to redistribute
sleep 60
Worker 1 (10.0.50.11)¶
Repeat the same steps for worker1:
# Step 1: Cordon
kubectl cordon worker1
# Step 2: Drain
kubectl drain worker1 \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--grace-period=300 \
--timeout=600s
# Step 3: Verify
kubectl get pods -A -o wide | grep worker1
# Step 4: Reset
talosctl --nodes 10.0.50.11 reset \
--graceful=false \
--reboot \
--system-labels-to-wipe STATE \
--system-labels-to-wipe EPHEMERAL
# Step 5: Wait for reboot
ping 10.0.50.11
# Step 6: Apply config
cd talos
talhelper gencommand apply \
--node 10.0.50.11 \
--extra-flags="--insecure" \
| bash
# Step 7: Wait for Ready
kubectl wait --for=condition=Ready node/worker1 --timeout=600s
# Step 8: Verify disk
talosctl --nodes 10.0.50.11 get discoveredvolumes | grep nvme0n1
# Step 9: Uncordon
kubectl uncordon worker1
# Step 10: Health check
kubectl get nodes
sleep 60
Worker 2 (10.0.50.12)¶
Repeat the same steps for worker2:
# Step 1: Cordon
kubectl cordon worker2
# Step 2: Drain
kubectl drain worker2 \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--grace-period=300 \
--timeout=600s
# Step 3: Verify
kubectl get pods -A -o wide | grep worker2
# Step 4: Reset
talosctl --nodes 10.0.50.12 reset \
--graceful=false \
--reboot \
--system-labels-to-wipe STATE \
--system-labels-to-wipe EPHEMERAL
# Step 5: Wait for reboot
ping 10.0.50.12
# Step 6: Apply config
cd talos
talhelper gencommand apply \
--node 10.0.50.12 \
--extra-flags="--insecure" \
| bash
# Step 7: Wait for Ready
kubectl wait --for=condition=Ready node/worker2 --timeout=600s
# Step 8: Verify disk
talosctl --nodes 10.0.50.12 get discoveredvolumes | grep nvme0n1
# Step 9: Uncordon
kubectl uncordon worker2
# Step 10: Health check
kubectl get nodes
Post-Upgrade Verification¶
Verify all workers have the new disk layout:
for node in 10.0.50.10 10.0.50.11 10.0.50.12; do
echo "=== Node $node ==="
talosctl --nodes $node get discoveredvolumes | grep nvme0n1p5
done
Verify all nodes are Ready:
Commit the configuration:
git add -A
git commit -m "feat: add rook-ceph with dedicated storage partitions on workers"
git push
Deploy Rook-Ceph¶
After all workers are upgraded, Flux will automatically deploy rook-ceph:
# Force Flux reconciliation
task reconcile
# Watch rook-ceph deployment
watch kubectl get pods -n rook-ceph
# Monitor Flux HelmReleases
flux get hr -A
# Check Ceph cluster formation (may take 5-10 minutes)
kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph get cephblockpool
Troubleshooting¶
Node stuck in "NotReady" after config apply¶
# Check node logs
talosctl --nodes <NODE_IP> dmesg | tail -50
# Check kubelet status
talosctl --nodes <NODE_IP> service kubelet status
# If needed, reboot manually
talosctl --nodes <NODE_IP> reboot
Drain hangs or times out¶
# Identify stuck pods
kubectl get pods -A -o wide | grep <NODE_NAME>
# Force delete stuck pods (last resort)
kubectl delete pod <POD_NAME> -n <NAMESPACE> --grace-period=0 --force
# Then re-run drain
kubectl drain <NODE_NAME> --ignore-daemonsets --delete-emptydir-data --force
Node can't rejoin cluster¶
# Check if node can reach control plane
talosctl --nodes <NODE_IP> get members
# Verify network connectivity
talosctl --nodes <NODE_IP> get links
# Check certificates
talosctl --nodes <NODE_IP> get certificates
# Re-apply config if needed
cd talos
talhelper gencommand apply --node <NODE_IP> --extra-flags="--insecure" | bash
Partition not created correctly¶
# Check actual disk layout
talosctl --nodes <NODE_IP> disks
# Check discovered volumes
talosctl --nodes <NODE_IP> get discoveredvolumes
# If partition is missing, the node needs to be reset again:
talosctl --nodes <NODE_IP> reset --graceful=false --reboot \
--system-labels-to-wipe STATE --system-labels-to-wipe EPHEMERAL
Rollback Procedure¶
If you need to rollback a worker before completing all three:
-
Remove the worker patch from talconfig.yaml:
-
Regenerate configs:
-
Reset and re-apply the node:
Timeline Estimate¶
- Per worker node: ~10-15 minutes
- Drain: 2-5 minutes
- Reset & reboot: 2-3 minutes
- Config apply: 1-2 minutes
- Node Ready: 2-3 minutes
-
Pod rescheduling: 2-3 minutes
-
Total for all 3 workers: ~30-45 minutes
Safety Considerations¶
✅ Safe: - Control plane remains fully operational - 2 out of 3 workers available during each upgrade - Workloads automatically reschedule to healthy nodes - Can pause between workers to monitor stability
⚠️ Risks: - Workloads without multi-replica deployments may experience downtime - Pods using local storage (emptyDir, hostPath) will be terminated - StatefulSets may be disrupted if they don't have pod anti-affinity
🔧 Recommendations: - Perform during maintenance window if possible - Ensure critical workloads have multiple replicas - Monitor cluster closely during each worker upgrade - Wait and verify stability before proceeding to next worker