2019-03-28 Punjab Cluster resize activity (RCA)

Date	01 Apr 2019
Team	DevOps
Participants	Gajendran C (Unlicensed) Dhananjay Singh Ghanshyam Rawat Abhishek Jain Nithin DV (Unlicensed)

Background

Upon the peak memory utilisation of all the existing nodes in punjab cluster and implementation team's demand to match the need of the load, had to resize the cluster by adding better nodes. Of course upon customer approval & Demand.
1. Punjab cluster resizing activity, causing customer dashboard to loose its visualisation and dashboard.
While there was a peak load, the DB was bottleneck due to long running queries that cascaded into APIs and the system was slow.
1. We had to fix the queries

CLUSTER:

Though the cluster resize task as was successful, stateful pods (es-cluster) the way it is configured and the way kubernetes manages via headless service discovery resulted in ES-MASTER PODS being not connecting to each other, by the time the DevOps team troubleshoot and recover the ES-MASTER, Kibana lost its visualisation and dashboard indexes.
No impact to the ES-Data, but kibana as lost its dashboard and visualisation indexes.

DB:

APIs were slow during the peak load from the punjab system.
Reports were taking much of the DB capacity due the inefficient joins, non-indexed data and Nested loop queries to DB, etc.
- We fixed DB Queries and brought down the

Start doing	Stop doing	Keep doing
Publish strategy plan before the actual implementation Any production change should be tried in UAT first. Take Kibana Indexes backup periodically into GitHub Need to perform Chaos Testing and resiliency testing on ES, Kafka DB Script review cadence	DB query check-in without review	Saying no to on-the-fly requests No to friendly requests, enforce Ops Tickets. Keep 3 master quorum format for ES, Kafka in production

Gajendran C (Unlicensed), Abhishek Jain to figure out on how the service discovery strategy being followed at EkStep (Mathews) – Need to explore the Headless Service Discovery topic. 08 Apr 2019
Nithin DV (Unlicensed) Partitioning strategy for tables (Time & Tenant Based and on kafka topics) 05 Apr 2019
Ghanshyam Rawat Strategy plan to Index data based on columns of the tables, Query Optimisation 05 Apr 2019
Ghanshyam Rawat DB Script review process plan 05 Apr 2019
Abhishek Jain, Nithin DV (Unlicensed) To review what kind of data being captures and logged by Jaeger tracing, logs should be masked (Important) 04 Apr 2019
Gajendran C (Unlicensed) Production infra alert / dashboard (ES, Kafka, Zuul, DB) to the roadmap 30 Apr 2019
Nithin DV (Unlicensed), Tarun Lalwani To come up with the proof point on the need for synchronous calls to solve the specific use case.
Gajendran C (Unlicensed) Setup the Platform PRDs prioritisation meeting with Abhishek Jain to keep it intact with the Operating plan strategy. 04 Apr 2019