RCA-HEALTH-EKS-DEMO Kubernetes Cluster Upgrade(1.21 to 1.28) -Outage-April 17th,2024

This is regarding the recent issues encountered during the incremental Kubernetes version upgrade from 1.21 to 1.28 of our AWS EKS cluster, health-eks-demo.

Incident Overview: During the upgrade, we faced several challenges:

Load Balancer Provisioning: The classic load balancer provisioning remained pending.
Worker Node Connectivity: EC2 instances within the health-eks-demo-spot-eks_asg autoscaling group failed to join the cluster.

upgrade specific steps that have been carried out:

The EKS control plane was manually upgraded from version 1.21 to 1.28 through the AWS console, with incremental updates.
EKS control plane add-ons, including kube-proxy, CoreDNS, and the EBS CSI Driver, were updated to their latest versions.
Worker node group AMIs were updated to version 1.28 (AMI-0252c64ffa4a894bb).
To address an issue where volumes were not attaching to pods, a snapshot controller was provisioned. An OIDC provider for the EKS cluster was also created, with the necessary permissions provided.
After these adjustments, volumes began attaching successfully, allowing for the upgrade of Elasticsearch and Kibana.
Post-upgrade, the Nginx ingress controller failed due to the use of deprecated APIs. It was necessary to upgrade to the latest version tested in our LTS release.
Following the ingress controller upgrade, the provisioning of the classic load balancer on AWS remained pending; its EXTERNAL-IP status also showed as pending. This issue, typically managed by the legacy cloud provider under EKS managed service, lacked diagnostic logs.
We have raised a support ticket with AWS EKS team

Troubleshooting Steps:

AWS Support Collaboration: Communicated our upgrade steps to the AWS support team for guidance.
Logging: Enabled API server, scheduler, and controller logging, which unfortunately did not reveal significant insights.
Log Collection: Executed the EKS log collector script to gather detailed diagnostics:
curl -O https://raw.githubusercontent.com/awslabs/amazon-eks-ami/master/log-collector-script/linux/eks-log-collector.sh sudo bash eks-log-collector.sh
Runbook Analysis: Followed AWS’s recommended Systems Manager Automation Runbook to analyze the worker nodes and cluster.

Findings and Resolutions:

Trust Relationship: Updated instance role permissions to address insufficient permissions for assuming necessary roles.
Kubelet State: Manually restarted the kubelet service after finding it in a 'dead' state.
Certificate Issues: Corrected CA certificate details in userdata after identifying x509 certificate errors, which enabled node joining.
VPC-CNI Plugin: Added the VPC-CNI plugin via the AWS Console, which moved all nodes to a 'ready' state and resolved the load balancer issue.

Lessons Learned:

Bastion Host Implementation: Establish a bastion host for secure and robust access to worker nodes.
Node Group Management: Transition from self-managed to managed node groups to minimize management overhead and prevent similar issues.
SSH Key Management: Regularly update and manage SSH keypairs to ensure readiness for any outage.
Infrastructure as Code (IaC): Use Terraform for upgrades to maintain consistent, reliable, and reproducible configurations across our infrastructure.