Punjab kibana dashboard loosing visualisation (RCA)

Background of the problem:

  1. Punjab cluster was often hitting high memory (95%) on a daily basis since the cluster was having just enough memory on the nodes to serve the daily traffic. During any peak load, the system used to be slow due to insufficient memory which was being served by m4.large EC2 instances
  2. During the March 2019 peak traffic, the insufficiency of the memory was so intense where the system was crossing beyond 100% memory utilisation that resulted in the latency at various services and impacting the overall system performance.

How did we react:

  1. Analysed the CPU and Memory utilisation at the granular level (Cluster, Nodes and PODs) and submitted a proposal to the customer and requested approval to increase the memory by resizing the cluster nodes from M4.large  to M4.xlarge which would double up the memory. Considering the facts customer agreed to proceed with the resizing activity and provided approval to implement immediately to mitigate the latency
  2. Apparently we also identified the long running DB queries using Jaeger (Distributed Tracing) across services (especially report services) and fixed the queries by adding more indexes and query optimisation.  

What was done as part of the exercise:

  1. Taken approval from the customer for the downtime of an hour and half to resize the cluster from m4.large instances to m4.xlarge. This activity required rolling update of all the services which is a default behaviour as all the pods will be drained from the existing nodes and be scheduled it on the newly provisioned nodes.
  2. Tested the DB query changes on UAT with the prod dump and upon seeing the improvement, pushed the DB query changes to production which resulted in great performance improvements.

Impact of the exercise:

      What went well:

  • DB query execution time got improved from ~30 sec to ~2 Sec, this resulted in great performance improvement at every service level also as a whole system.
  • End result on solving the memory availability was successful, All the pods of our services were having sufficient memory to handle the peak traffic.
  • During the cluster update, all egov services and infra services like kafka, zookeeper, redis were back up and running successfully.

      What went wrong:

  • During the cluster resize activity we had an issue with es-cluster pods in the HA setup not connecting to each other to form the quorum in a 3 node HA setup, during the rolling update it had entered a split brain scenario and pods were not starting for a while, in the due course of recovering the es-cluster back without loosing critical data eventually resulted in corrupting the ES kibana indexes. which resulted in loosing the kibana visualisation. 
  • We were able to recover the entire cluster without impact to any of the data indices but the kibana index hosting the dashboards and visualisations was beyond recovery
  • These dashboards and visualisations were unfortunately different from UAT to do a simple promotion nor were they stored in source control for an easy restore. 

 Retrospective

  • Always have UAT as exact replica of prod in terms of all the infrastructure configuration. Perform any production roll-out in UAT first and upon success perform on PROD.
  • Redesign our prod elastic search cluster to be HA (details) and fault tolerant.
  • Start source controlling all the Kibana Visualisation indexes periodically and also ensure before doing any cluster changes. 
Start doingStop doingKeep doing
  • Publish strategy plan before the actual implementation
  • Any production change should be tried in UAT first and preferably with similar setup.
  • Store dashboards and visualisations in source control and have them on UAT.
  • Redesign our elastic search cluster to be HA (details) and fault tolerant.
  • Need to implement Chaos Testing and resiliency testing on ES, Kafka
  • DB Script review cadence to avoid introducing latency in the Query execution.
  • UAT should be replicated exactly as Prod
  • Track the RCA action items to the closure in JIRA
  • DB query check-in without review.
  • Hurried changes to production.
  • Pushing changes to production without having them on UAT.
  • Distributed tracing which has helped us resolve issues quickly.
  • Saying no to on-the-fly requests.
  • No to friendly requests, enforce Ops Tickets.
  • Keep 3 master quorum format for ES, Kafka in production.
  • Test infra services for resiliency.

Action items

  • Gajendran C (Unlicensed), Abhishek Jain to figure out on how the service discovery strategy being followed at EkStep (Mathews)  – Need to explore the Headless Service Discovery topic. 
  • Nithin DV (Unlicensed) Partitioning strategy for tables (Time & Tenant Based and on kafka topics)  
  • Ghanshyam Rawat Strategy plan to Index data based on columns of the tables, Query Optimisation  
  • Ghanshyam Rawat DB Script review process plan  
  • Abhishek Jain, Nithin DV (Unlicensed) To review what kind of data being captures and logged by Jaeger tracing, logs should be masked  (Important)  
  • Gajendran C (Unlicensed) Production infra alert / dashboard (ES, Kafka, Zuul, DB) to the roadmap  
  • Nithin DV (Unlicensed), Tarun Lalwani To come up with the proof point on the need for synchronous calls to solve the specific use case.
  • Gajendran C (Unlicensed) Setup the Platform PRDs prioritisation meeting with Abhishek Jain to keep it intact with the Operating plan strategy.   
  • Onboard implementation team to follow the Ops ticket and approval process.