Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Analysed the CPU and Memory utilisation at the granular level (Cluster, Nodes and PODs) and submitted the report to the customer and requested approval to increase the memory by resizing the cluster nodes from M4.large to M4.xlarge which will double the memory. Considering the facts customer agreed to proceed with the resizing activity and provided approval to implement immediately to mitigate the latency. 
  2. Apparently we also analysed the long running DB queries using Jaeger (Distributed Tracing) across services and fixed the queries by adding more indexes and query optimisation.  
  3. Upon the peak memory utilisation of all the existing nodes in punjab cluster and implementation team's demand to match the need of the load, had to resize the cluster by adding better nodes in accordance to customer approval & Demand.
    1. Punjab cluster resizing activity, causing customer dashboard to loose its visualisation and dashboard.
  4. While there was a peak load, the DB was bottleneck due to long running queries that cascaded into APIs and the system was slow.
    1. We had to fix the queries  

Retrospective

...

CLUSTER:

...

What did we do:

  1. Taken approval from the customer for the downtime of an hour and half to resize the cluster from m4.large instances to m4.xlarge. This activity required rolling update of all the services which is a default behaviour as all the pods will be drained from the existing nodes and be scheduled it on the newly provisioned nodes.
  2. Tested the DB query changes on UAT with the prod dump and upon seeing the improvement, pushed the DB query changes to production which resulted in great performance improvements.

Impact of overall activity:

      What went well:

  • DB query execution time got improved from ~30 sec to ~2 Sec, this resulted in great performance improvement at every service level also as a whole system.
  • End result on solving the memory availability was successful, All the pods of our services were having sufficient memory to handle the peak traffic.
  • During the cluster update, all egov services and infra services like kafka, zookeeper, redis were back up and running

...

  • successfully.

      What went wrong:

  • Since the es-cluster not being configured for true HA, during the rolling update it had possibly entered a split brain scenario and pods were not starting, while we try recovering the critical data eventually resulted in corrupting the ES kibana index. which resulted in loosing the kibana visualisation. 
  • We were able to recover the entire cluster without impact to any of the data indices but the kibana index hosting the dashboards and visualisations was beyond recovery. 
  • These dashboards and visualisations were unfortunately different

...

  • from UAT to do a simple promotion nor were they stored in source control for an easy restore. 

DB:

...

  • We fixed DB response times across all modules by identifying long running queries via Jaeger, a distributed tracing system.

...

 Retrospective

Info
  • Always have UAT as exact replica of prod in terms of all the infrastructure configuration. Perform any production roll-out in UAT first and upon success perform on PROD.
  • Start source controlling all the Kibana Visualisation indexes.  



Start doingStop doingKeep doing
  • Publish strategy plan before the actual implementation
  • Any production change should be tried in UAT first and preferably with similar setup.
  • Store dashboards and visualisations in source control and have them on UAT.
  • Redesign our elastic search cluster to be HA (details) and fault tolerant.
  • Need to perform 
    • Chaos Testing and resiliency testing on ES, Kafka
  • DB Script review cadence
  • UAT should be replicated exactly as Prod
  • DB query check-in without review.
  • Hurried changes to production.
  • Pushing changes to production without having them on UAT.
  • Distributed tracing which has helped us resolve issues quickly.
  • Saying no to on-the-fly requests.
  • No to friendly requests, enforce Ops Tickets.
  • Keep 3 master quorum format for ES, Kafka in production.
  • Test infra services for resiliency.

...