Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Upon the peak memory utilisation of all the existing nodes in punjab cluster and implementation team's demand to match the need of the load, had to resize the cluster by adding better nodes . Of course upon in accordance to customer approval & Demand.
    1. Punjab cluster resizing activity, causing customer dashboard to loose its visualisation and dashboard.
  2. While there was a peak load, the DB was bottleneck due to long running queries that cascaded into APIs and the system was slow.
    1. We had to fix the queries  

...

Start doingStop doingKeep doing
  • Publish strategy plan before the actual implementation
  • Any production change should be tried in UAT first and preferably with similar setup.
  • Store dashboards and visualisations in source control and have them on UAT.
  • Redesign our elastic search cluster to be truly HA (details) and fault tolerant.
  • Need to perform 
    • Chaos Testing and resiliency testing on ES, Kafka
  • DB Script review cadence
  • UAT should be replicated exactly as Prod
  • DB query check-in without review.
  • Hurried changes to production.
  • Pushing changes to production without having them on UAT.
  • Distributed tracing which has helped us resolve issues quickly.
  • Saying no to on-the-fly requests.
  • No to friendly requests, enforce Ops Tickets.
  • Keep 3 master quorum format for ES, Kafka in production.
  • Test infra services for resiliency.

...

  •  Gajendran C (Unlicensed), Abhishek Jain to figure out on how the service discovery strategy being followed at EkStep (Mathews)  – Need to explore the Headless Service Discovery topic.  
  •   
  •  Nithin DV (Unlicensed) Partitioning strategy for tables (Time & Tenant Based and on kafka topics)  
  •  Ghanshyam Rawat Strategy plan to Index data based on columns of the tables, Query Optimisation  
  •  Ghanshyam Rawat DB Script review process plan  
  •  Abhishek Jain, Nithin DV (Unlicensed) To review what kind of data being captures and logged by Jaeger tracing, logs should be masked  (Important)  
  •  Gajendran C (Unlicensed) Production infra alert / dashboard (ES, Kafka, Zuul, DB) to the roadmap  
  •  Nithin DV (Unlicensed), Tarun Lalwani To come up with the proof point on the need for synchronous calls to solve the specific use case.
  •  Gajendran C (Unlicensed) Setup the Platform PRDs prioritisation meeting with Abhishek Jain to keep it intact with the Operating plan strategy.   
  •  Onboard implementation team
  •