2019-03-28 Punjab Cluster resize activity (RCA)

Date	01 Apr 2019
Team	DevOps
Participants	Gajendran C (Unlicensed) Dhananjay Singh Ghanshyam Rawat Abhishek Jain Nithin DV (Unlicensed)

Background on the Problem:

Punjab cluster was often hitting high memory (95%) on a daily basis as the cluster was having just enough memory on the nodes to serve the daily traffic. During any peak load, the system was slow due to insufficient memory which was being served by m4.large EC2 instances.
During the March peak traffic, the memory issue was so intense where the system was crossing beyond 100% memory utilisation which in tern resulted in the latency response of various services.

Analysed the CPU and Memory utilisation at the granular level (Cluster, Nodes and PODs) and submitted the report to the customer and requested approval to increase the memory by resizing the cluster nodes from M4.large to M4.xlarge which will double the memory. Considering the facts customer agreed to proceed with the resizing activity and provided approval to implement immediately to mitigate the latency.
Apparently we also analysed the long running DB queries using Jaeger (Distributed Tracing) across services and fixed the queries by adding more indexes and query optimisation.
Upon the peak memory utilisation of all the existing nodes in punjab cluster and implementation team's demand to match the need of the load, had to resize the cluster by adding better nodes in accordance to customer approval & Demand.
1. Punjab cluster resizing activity, causing customer dashboard to loose its visualisation and dashboard.
While there was a peak load, the DB was bottleneck due to long running queries that cascaded into APIs and the system was slow.
1. We had to fix the queries

CLUSTER:

We had carried out the cluster upgrade activity on UAT where we did not run into any issues, however our ES setup differs between the environments, UAT has a single master while PROD has 3 master nodes.
The cluster resize activity on PROD was largely successful, all egov services and infra services like kafka, zookeeper, redis were back up and running. Our es-cluster not being configured for true HA, had possibly entered a split brain scenario, eventually corrupting the ES kibana index.
We were able to recover the entire cluster without impact to any of the data indices but the kibana index hosting the dashboards and visualisations was beyond recovery.
These dashboards and visualisations were unfortunately different on UAT to do a simple promotion nor were they stored in source control for an easy restore.

DB:

APIs were slow during the peak load from the Punjab system wherein various modules were taking ~20-40s to respond.
Modules were synchronously waiting on database for search queries to return, eventually even delaying the persistence of data due to overhead.
All operations happening via kafka such as persistence, indexing, notifications are not horizontally scalable due to the current kafka partitioning.
Reports were taking much of the DB capacity due the inefficient joins, non-indexed data and Nested loop queries to DB, etc.
- We fixed DB response times across all modules by identifying long running queries via Jaeger, a distributed tracing system.
System was performing much better after these optimisations bringing response times of APIs from 20-40s to ~1-2s.

Start doing	Stop doing	Keep doing
Publish strategy plan before the actual implementation Any production change should be tried in UAT first and preferably with similar setup. Store dashboards and visualisations in source control and have them on UAT. Redesign our elastic search cluster to be HA (details) and fault tolerant. Need to perform Chaos Testing and resiliency testing on ES, Kafka DB Script review cadence UAT should be replicated exactly as Prod	DB query check-in without review. Hurried changes to production. Pushing changes to production without having them on UAT.	Distributed tracing which has helped us resolve issues quickly. Saying no to on-the-fly requests. No to friendly requests, enforce Ops Tickets. Keep 3 master quorum format for ES, Kafka in production. Test infra services for resiliency.

Gajendran C (Unlicensed), Abhishek Jain to figure out on how the service discovery strategy being followed at EkStep (Mathews) – Need to explore the Headless Service Discovery topic. 08 Apr 2019
10 Apr 2019
Nithin DV (Unlicensed) Partitioning strategy for tables (Time & Tenant Based and on kafka topics) 05 Apr 2019
Ghanshyam Rawat Strategy plan to Index data based on columns of the tables, Query Optimisation 05 Apr 2019
Ghanshyam Rawat DB Script review process plan 05 Apr 2019
Abhishek Jain, Nithin DV (Unlicensed) To review what kind of data being captures and logged by Jaeger tracing, logs should be masked (Important) 04 Apr 2019
Gajendran C (Unlicensed) Production infra alert / dashboard (ES, Kafka, Zuul, DB) to the roadmap 30 Apr 2019
Nithin DV (Unlicensed), Tarun Lalwani To come up with the proof point on the need for synchronous calls to solve the specific use case.
Gajendran C (Unlicensed) Setup the Platform PRDs prioritisation meeting with Abhishek Jain to keep it intact with the Operating plan strategy. 04 Apr 2019
Onboard implementation team