...
- Analysed the CPU and Memory utilisation at the granular level (Cluster, Nodes and PODs) and submitted the report to the customer and requested approval to increase the memory by resizing the cluster nodes from M4.large large to M4.xlarge which will double the memory. Considering the facts customer agreed to proceed with the resizing activity and provided approval to implement immediately to mitigate the latency.
- Apparently we also analysed the long running DB queries using Jaeger (Distributed Tracing) across services and fixed the queries by adding more indexes and query optimisation.
...
- DB query execution time got improved from ~30 sec to ~2 Sec, this resulted in great performance improvement at every service level also as a whole system.
- End result on solving the memory availability was successful, All the pods of our services were having sufficient memory to handle the peak traffic.
- During the cluster update, all egov services and infra services like kafka, zookeeper, redis were back up and running successfully.
What went wrong:
- Since the es-cluster not being configured for true HA, During the cluster resize activity we had an issue with es-cluster not connecting the master pods as it was unable to form the quorum in a 3 node HA setup, due to which during the rolling update it had possibly entered a split brain scenario and pods were not starting, while we try recovering the critical data eventually resulted in corrupting the ES kibana index. which resulted in loosing the kibana visualisation.
- We were able to recover the entire cluster without impact to any of the data indices but the kibana index hosting the dashboards and visualisations was beyond recovery.
- These dashboards and visualisations were unfortunately different from UAT to do a simple promotion nor were they stored in source control for an easy restore.
...
Info |
---|
|
Start doing | Stop doing | Keep doing |
---|---|---|
|
|
|
...
- Gajendran C (Unlicensed), Abhishek Jain to figure out on how the service discovery strategy being followed at EkStep (Mathews) – Need to explore the Headless Service Discovery topic.
- Nithin DV (Unlicensed) Partitioning strategy for tables (Time & Tenant Based and on kafka topics)
- Ghanshyam Rawat Strategy plan to Index data based on columns of the tables, Query Optimisation
- Ghanshyam Rawat DB Script review process plan
- Abhishek Jain, Nithin DV (Unlicensed) To review what kind of data being captures and logged by Jaeger tracing, logs should be masked (Important)
- Gajendran C (Unlicensed) Production infra alert / dashboard (ES, Kafka, Zuul, DB) to the roadmap
- Nithin DV (Unlicensed), Tarun Lalwani To come up with the proof point on the need for synchronous calls to solve the specific use case.
- Gajendran C (Unlicensed) Setup the Platform PRDs prioritisation meeting with Abhishek Jain to keep it intact with the Operating plan strategy.
- Onboard implementation team
- to follow the Ops ticket and approval process.