Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Page Properties



Background on the Problem:

  1. Punjab cluster was often hitting high memory (95%) on a daily basis as the cluster was having just enough memory on the nodes to serve the daily traffic. During any peak load, the system was slow due to insufficient memory which was being served by m4.large EC2 instances. 
  2. During the March peak traffic, the memory issue was so intense where the system was crossing beyond 100% memory utilisation which in tern resulted in the latency response of various services.

How did we react:

  1. Analysed the CPU and Memory utilisation at the granular level (Cluster, Nodes and PODs) and submitted the report to the customer and requested approval to increase the memory by resizing the cluster nodes from M4.large to M4.xlarge which will double the memory. Considering the facts customer agreed to proceed with the resizing activity and provided approval to implement immediately to mitigate the latency. 
  2. Apparently we also analysed the long running DB queries using Jaeger (Distributed Tracing) across services and fixed the queries by adding more indexes and query optimisation.  
  3. Upon the peak memory utilisation of all the existing nodes in punjab cluster and implementation team's demand to match the need of the load, had to resize the cluster by adding better nodes in accordance to customer approval & Demand.
    1. Punjab cluster resizing activity, causing customer dashboard to loose its visualisation and dashboard.
  4. While there was a peak load, the DB was bottleneck due to long running queries that cascaded into APIs and the system was slow.
    1. We had to fix the queries  

...

  •  Gajendran C (Unlicensed), Abhishek Jain to figure out on how the service discovery strategy being followed at EkStep (Mathews)  – Need to explore the Headless Service Discovery topic.  
  •   
  •  Nithin DV (Unlicensed) Partitioning strategy for tables (Time & Tenant Based and on kafka topics)  
  •  Ghanshyam Rawat Strategy plan to Index data based on columns of the tables, Query Optimisation  
  •  Ghanshyam Rawat DB Script review process plan  
  •  Abhishek Jain, Nithin DV (Unlicensed) To review what kind of data being captures and logged by Jaeger tracing, logs should be masked  (Important)  
  •  Gajendran C (Unlicensed) Production infra alert / dashboard (ES, Kafka, Zuul, DB) to the roadmap  
  •  Nithin DV (Unlicensed), Tarun Lalwani To come up with the proof point on the need for synchronous calls to solve the specific use case.
  •  Gajendran C (Unlicensed) Setup the Platform PRDs prioritisation meeting with Abhishek Jain to keep it intact with the Operating plan strategy.   
  •  Onboard implementation team
  •