eGov ERP DevOps
AP Production microserivces down
Problem Statement:-
Title | AP Production microserivces down |
---|---|
References | Complaint Number: [#EUW-588-62195] |
Description |
|
Occurrence Date | [14/08/2019] |
Impact |
|
RCA Owner |
|
Observations/Analysis of the problem:-
AP Prod Microservices stack:
uptime as on Fri Aug 16 15:08:00 IST 2019
Kube Controller: 10.0.0.50 - 15:06:13 up 160 days, 22:34, 1 user, load average: 0.02, 0.08, 0.12
etcd: 10.0.0.51 - 15:07:04 up 161 days, 1:31, 1 user, load average: 0.01, 0.04, 0.05
minion1: 10.0.0.52 - 15:07:12 up 3:17, 1 user, load average: 0.11, 0.10, 0.07
minion2: 10.0.0.53 - 15:07:22 up 52 days, 9:15, 1 user, load average: 1.70, 1.88, 1.98
AP pods list: kubectl get pods --all-namespaces
Due to the minion 1 server got restarted its has impacted on the kube-minions as a result volume mounts got detached on the kube-minions 1 and 2. This caused all the services giving 502 bad gateway error.
As 10.0.0.53 server got restarted, as part of reboot/restart of a minion-1 server:
Auto volume (NFS) mounting not happened
A restart of kubelete, kube-proxy, docker, and flanneld failed: this server is throwing “connection timed out” error. Because this server got hung state and any of the command on this server is not executable. Since systemd is not working as many resources were running in the background it doesn't get the process to run.
Solution Approach:-
Case 1: On Wed Aug 14 17:00 IST 2019, all the pods with namespace backbone went to CrashLoopBackOff state. As in AP production one of the server which is having one of the minions got restarted due to maintenance which is confirmed by the ctrls team. After this Kafka, Zookeeper, and nginx were not able to sync up with each other.
Kafka is in CrashLoopBackOff state:
Describing the kafka: Error syncing pod, skipping: failed to "StartContainer" for "kafka" with CrashLoopBackOff.
kafka logs saying: “Unable to connect to zookeeper server within timeout: 6000”.
zookeeper saying: “Connection refused”.
nginx is in running status: Successfully assigned nginx-1242251100-8k5v9 to prod-kube-minions.
nginx logs saying: “502 bad gateway error”.
Due to the above reasons, nginx is not able to receive any requests to the services.
Solution:
Reboot 10.0.0.53: it will kill all the processes running in the background.
On 10.0.0.53(minion) server Restart kubelete, kube-proxy, docker, and flanneld: “systemctl restart kubelet kube-proxy flanneld docker”
On 10.0.0.53(minion) server attach volume mount (NFS) to the server: “mount -t nfs 10.0.0.41:/NFS1-Share /NFS-Server1”
On 10.0.0.53(minion) server: check the status: “systemctl status kubelet kube-proxy flanneld docker”
If not restart the api-server on controller (10.0.0.50): “for SERVICES in flanneld docker kube-apiserver kube-controller-manager kube-scheduler; do systemctl restart $SERVICES; systemctl enable $SERVICES; systemctl status $SERVICES ; done”
Timeline:-
Time | Involved Parties | Description |
---|---|---|
|
|
|
|
|
|
Conclusion:-
Related articles
DevOps as a Culture