Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Retrospective

Info

CLUSTER:

  • Though the cluster resize task as was successful, stateful pods (es-cluster) the way it is configured and the way kubernetes manages via headless service discovery resulted in ES-MASTER PODS being not connecting to each other, by the time the DevOps team troubleshoot and recover the ES-MASTER, Kibana lost its visualisation and dashboard indexes.
  • No impact to the ES-Data, but kibana as lost its dashboard and visualisation indexesWe had carried out the cluster upgrade activity on UAT where we did not run into any issues, however our ES setup differs between the environments, UAT has a single master while PROD has 3 master nodes.
  • The cluster resize activity on PROD was largely successful, all egov services and infra services like kafka, zookeeper, redis were back up and running. Our es-cluster not being configured for true HA, had possibly entered a split brain scenario, eventually corrupting the ES kibana index.
  • We were able to recover the entire cluster without impact to any of the data indices but the kibana index hosting the dashboards and visualisations was beyond recovery. 
  • These dashboards and visualisations were unfortunately different on UAT to do a simple promotion nor were they stored in source control for an easy restore

DB:

  • APIs were slow during the peak load from the punjab system.Punjab system wherein various modules were taking ~20-40s to respond.
  • Modules were synchronously waiting on database for search queries to return, eventually even delaying the persistence of data due to overhead.
  • All operations happening via kafka such as persistence, indexing, notifications are not horizontally scalable due to the current kafka partitioning. 
  • Reports were taking much of the DB capacity due the inefficient joins, non-indexed data and Nested loop queries to DB, etc.
    • We fixed DB Queries and brought down the response times across all modules by identifying long running queries via Jaeger, a distributed tracing system.
  • System was performing much better after these optimisations bringing response times of APIs from 20-40s to ~1-2s. 


Start doingStop doingKeep doing
  • Publish strategy plan before the actual implementation
  • Any production change should be tried in UAT first .Take Kibana Indexes backup periodically into GitHuband preferably with similar setup.
  • Store dashboards and visualisations in source control and have them on UAT.
  • Redesign our elastic search cluster to be truly HA and fault tolerant.
  • Need to perform 
    • Chaos Testing and resiliency testing on ES, Kafka
  • DB Script review cadence
  • DB query check-in without review.
  • Hurried changes to production.
  • Pushing changes to production without having them on UAT.
  • Distributed tracing which has helped us resolve issues quickly.
  • Saying no to on-the-fly requests.
  • No to friendly requests, enforce Ops Tickets.
  • Keep 3 master quorum format for ES, Kafka in production.
  • Test infra services for resiliency.

Action items

...