...
Page Properties | ||||||
---|---|---|---|---|---|---|
|
Background of the problem:
- During the March 2019 peak traffic multiple issues were raised by PMIDC related to long response times of services and few errors (due to a combination of the asynchronous nature of our APIs and the UI not handling these well).
- DB CPU utilisation was being maxed out due to long running queries, ; alerts for the same were received via AWS RDS SNS.
...
- Analyzed AWS RDS utilisation graphs to derive possible patterns of high utilization, which helped us narrow down the key hour wherein we had the highest utilization.
- Based on the metrics above we digged dug deep to analyze API response times and detailed trace of services using our distributed tracing setup, Jaeger.
- Several APIs were taking ~20-40s with majority of the time being spent querying the database; on analyzing the queries it was clear that they were not running optimally.
...
- Long running queries were analyzed using Postgres tools, and fixed by adding necessary indices on UAT then on PROD.
- Modules Further, modules were analyzed and indices were added on all commonly searched columns.
- Increased monitoring Monitoring was increased in an attempt to stay on top of such situations.
Impact of the exercise:
What went well:
- DB query execution time got API response times improved from ~20-40s to ~2s, this resulted in great performance improvement at every service also as a whole system, graphs attached below.
- Random errors which occurred due to slow asynchronous persistance of data was also resolved.
- AWS RDS utilisation was no longer hitting 100% and remained well within limits, graphs depicting max and average CPU usage attached below.
What went wrong:
- None
...
- Nithin DV (Unlicensed) Partitioning strategy for tables (Time & Tenant Based and on kafka topics)
- Ghanshyam Rawat Nithin DV (Unlicensed) Tarun Lalwani Strategy plan to Index data based on columns of the tables, Query Optimisation
- Ghanshyam Rawat Nithin DV (Unlicensed) Tarun Lalwani DB Script review process plan
- Abhishek Jain, Nithin DV (Unlicensed) To review what kind of data being captures and logged by Jaeger tracing, logs should be masked (Important)
- Gajendran C (Unlicensed) Nithin DV (Unlicensed) Production infra alerting and dashboards (ES, Kafka, Zuul, Jaeger, DB) to the roadmap.
- Tarun Lalwani Come up with strategy to do basic performance testing on services
- Nithin DV (Unlicensed), Tarun Lalwani To come up with the proof point on the need for synchronous calls to solve the specific use case.