Pre-upgrade Infrastructure Checks

Before upgrading your Kubernetes cluster, it's critical to validate the infrastructure to ensure a smooth and reliable upgrade process. This checklist helps prevent downtime, compatibility issues, and data loss.

1. Backup

Create a manual backup using Velero

Before starting the upgrade, create a full cluster backup with Velero. This should include both Kubernetes resources (e.g., Deployments, Services, ConfigMaps) and persistent volumes.

Steps:

1 Ensure Velero is properly installed and configured with a backup storage location (e.g., an S3 bucket).

2 Trigger a manual backup using the CLI:

velero backup create pre-upgrade-backup --include-namespaces '*' --wait

3 Verify the backup status and contents:

velero backup describe pre-upgrade-backup --details

It is recommended to validate the integrity of this backup, e.g. in a staging environment, before proceeding with the upgrade.

2. Cluster Health Validation

Check cluster health

Before upgrading, ensure the cluster is in a healthy state and all core components are running as expected.

Steps:

1 Check control plane components:

kubectl get --raw='/readyz?verbose'

This command returns detailed readiness checks for core control plane components such as the scheduler, controller-manager, and etcd. Ensure each component reports a + (healthy) status before proceeding.

2 Verify node readiness:

kubectl get nodes

All nodes should be in Ready state with no taints or conditions indicating memory pressure, disk pressure, or PID pressure.

3 Optionally, run a cluster-wide diagnostic tool like kube-bench, kube-score, or rancher cluster check to get a broader health overview.

Tip: If any node is in NotReady or control plane components show unhealthy, resolve these issues before continuing with the upgrade.

3. Compatibility Checks

Validate Rancher, Helm, CRDs, plugins, and EKS add-ons

Before upgrading Kubernetes, ensure all key tools and integrations are compatible with the target version.

Steps:

1 Check Rancher compatibility:

Review the Rancher support matrix to verify your current Rancher version supports the desired Kubernetes version.

2 Review Helm releases and CRDs:

helm list -A
kubectl get crds

Helm: List all deployed Helm releases across namespaces and review them for compatibility with the target Kubernetes version. Upgrade any charts that rely on deprecated APIs or include breaking changes.

CRDs: Review all CustomResourceDefinitions (CRDs) in your cluster. Check whether their apiVersion is compatible with the Kubernetes version you're upgrading to. Deprecated or removed API versions can cause workloads to break post-upgrade.

3 Core integrations compatibility:

Before upgrading, it’s crucial to ensure that all core integrations—like ingress controllers, container networking (CNI), and storage drivers (CSI)—are compatible with the target Kubernetes version. Incompatible plugins can lead to networking failures, pod scheduling issues, or even a broken cluster after upgrade.

Ingress controllers: Check the version of your ingress controller (e.g. NGINX) and consult its documentation to confirm support for the new Kubernetes version.

CNI/CSI plugins: Review the container network interface (CNI) and container storage interface (CSI) drivers in use. These components often have version-specific dependencies tied to the Kubernetes API and node OS.

Other integrations: Evaluate any custom controllers, admission webhooks, or third-party operators to ensure they're tested against the version you're upgrading to.

4 For EKS: Managed Add-ons

If you're running EKS, it's essential to check that all managed add-ons are compatible with the Kubernetes version you're upgrading to. These add-ons—such as CoreDNS, kube-proxy, and Amazon VPC CNI—are tightly coupled to the control plane and must be updated before upgrading the cluster version.

To check:

Open the AWS Console. Navigate to EKS > Clusters > [Your Cluster] > Add-ons.
Review current add-on versions:
- CoreDNS
- kube-proxy
- Amazon VPC CNI
- Any other installed add-ons
Confirm compatibility with the target Kubernetes version. The AWS Console will indicate whether your current versions are compatible or require an update. Look for upgrade prompts or warnings.
Update add-ons if needed. Use the AWS Console to upgrade any out-of-date or incompatible add-ons before upgrading your Kubernetes version. This ensures the control plane components can interact properly with networking and DNS services post-upgrade.

⚠️ Skipping this step may result in networking issues, DNS resolution failures, or broken cluster behavior after upgrade.

4. API Deprecations

Review deprecated APIs

Before upgrading, identify and address deprecated APIs to avoid post-upgrade failures.

Steps:

1 Review the Kubernetes GitHub release notes for deprecations.

2 Check official provider-specific documentation for EKS, AKS, or GKE.

3 Use tools like pluto or kubent to scan your cluster for deprecated API usage:

pluto detect-helm -o wide
kubent

5. Node Drain Test

Simulate a node upgrade

Validating how workloads behave during node evacuation helps catch disruptions early.

Steps:

1 Choose a worker node and drain it:

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

2 Monitor the pods being rescheduled:

kubectl get pods -A -o wide

3 Uncordon the node afterward:

kubectl uncordon <node-name>

Ensure all critical pods recover properly without manual intervention.

6. Networking

Validate NetworkPolicies and prepare for CNI changes

Before upgrading Kubernetes — especially if you're also updating the CNI plugin — review your NetworkPolicies to ensure they won't unintentionally block traffic. CNI behavior may change across versions, and overly strict or outdated policies can silently break service-to-service communication after the upgrade.

Steps:

1 List active policies:

kubectl get networkpolicies --all-namespaces

2 Review for issues

Policies might break app-to-app communication if they rely on deprecated labels/selectors.

3 Plan connectivity tests post-upgrade

Make note of critical service-to-service flows (e.g. frontend → backend, app → database) and prepare to test them immediately after the upgrade using tools like netshoot or kubectl exec.

7. Observability Stack

Rancher Monitoring with Prometheus and Grafana

Ensure that Rancher’s built-in monitoring stack is fully functional before starting the upgrade so you can rely on it during and after the process.

Steps:

1 Open the Rancher UI and navigate to the Monitoring section of your cluster. Verify that Prometheus and Grafana are deployed and show as Active.

2 Open Grafana from the Rancher UI and check the Kubernetes cluster dashboard:

Confirm data is updating in real time.
Look for any scrape errors or stale metrics, especially from system components like kubelet, coredns, and etcd.

3 In Prometheus, run test queries to confirm metrics are collected:

up
node_memory_MemAvailable_bytes
container_cpu_usage_seconds_total

4 If metrics are missing or dashboards show gaps:

Restart the Prometheus pods:

kubectl rollout restart deploy prometheus-prometheus -n cattle-monitoring-system

Or use Rancher UI to redeploy the monitoring stack.

5 Confirm alerting rules are in place and test one manually (e.g. scale down a deployment to trigger an alert).

Tip: Validating your monitoring before the upgrade ensures you have real-time visibility into cluster behavior during and after the upgrade.