Post-upgrade Infrastructure Checks

After upgrading your Kubernetes cluster, it’s important to verify that all infrastructure components are functioning properly. This ensures your workloads are running reliably and that observability and automation systems remain intact.

1. Node and Pod Health

Validate node and pod status

After the upgrade, confirm all cluster nodes are healthy and workloads are running as expected.

Steps:

1 Check node status:

kubectl get nodes

All nodes should show Ready without SchedulingDisabled.

2 Check for crashlooping or pending pods:

kubectl get pods -A --field-selector=status.phase!=Running

3 Run connectivity checks using a utility pod like netshoot to confirm service reachability across nodes.

2. Version Confirmation

Confirm UI displays new Kubernetes version

Verify that your cluster management interface (e.g., Rancher) shows the upgraded version and reports a healthy cluster state.

Steps:

1 In Rancher, go to the cluster dashboard and confirm the displayed Kubernetes version matches your target version.

2 Optionally confirm with the CLI:

kubectl version

3. Storage Validation

Check volume mounts

Verify that pods can still access their Persistent Volume Claims (PVCs) and that no mount errors occurred post-upgrade.

Steps:

1 List pods with volumes:

kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.volumes[*].persistentVolumeClaim.claimName}{"\n"}{end}'

2 Inspect logs of workloads that rely on persistent storage for any mount or I/O errors:

kubectl logs <pod-name>

3 Validate access to mounted volumes within pods:

kubectl exec -it <pod> -- ls /data

4. Observability Systems

Monitor metrics and alerts

Ensure your monitoring tools are still collecting data and alerts are firing as expected.

Steps:

1 Open Grafana dashboards and confirm metrics are flowing.

2 Check Prometheus targets page for scrape errors.

3 Review Alertmanager for any silenced or missing alerts.

Validate logging, monitoring, and tracing tools

1 Check that logging agents like FluentBit or Fluentd are still shipping logs to your central log system.

2 Validate OpenTelemetry collectors or Jaeger if using tracing.

3 Restart observability pods if needed:

kubectl rollout restart deploy prometheus-prometheus -n cattle-monitoring-system

5. Autoscaler Check

Cluster Autoscaler or Karpenter

Ensure your autoscaling logic is still working with the new Kubernetes version. Autoscalers automatically add or remove nodes based on workload demand — if they break after an upgrade, your cluster may not handle scaling events properly, leading to outages or resource exhaustion.

Steps:

1 Verify that Cluster Autoscaler or Karpenter pods are running:

kubectl get pods -n kube-system

(Look for pods like cluster-autoscaler or karpenter.)

2 Review logs for warnings or errors:

kubectl logs <autoscaler-pod> -n kube-system

Check for signs of misconfiguration or instability that might be exposed or worsened by the upgrade.

3 Optionally trigger a scale-up or scale-down event (e.g., by deploying a workload that exceeds current capacity) and confirm new nodes are provisioned or deprovisioned as expected.