Background of the problem:
- During the March 2019 peak traffic multiple issues were raised by PMIDC related to long response times of services and few errors (due to a combination of the asynchronous nature of our APIs and the UI not handling these well).
- DB CPU utilisation was being maxed out due to long running queries, alerts for the same were received via AWS RDS SNS.
How did we react:
- Analyzed AWS RDS utilisation graphs to derive possible patterns of high utilization, which helped us narrow down the key hour wherein we had the highest utilization.
- Based on the metrics above we dug deep to analyze API response times and detailed trace of services using our distributed tracing setup, Jaeger.
- Several APIs were taking ~20-40s with majority of the time being spent querying the database; on analyzing the queries it was clear that they were not running optimally.
What was done as part of the exercise:
- Long running queries were analyzed using Postgres tools, and fixed by adding necessary indices on UAT then on PROD.
- Further, modules were analyzed and indices were added on all commonly searched columns.
- Monitoring was increased in an attempt to stay on top of such situations.
Impact of the exercise:
What went well:
- API response times improved from ~20-40s to ~2s, this resulted in great performance improvement at every service also as a whole system, graphs attached below.
- Random errors which occurred due to slow asynchronous persistance of data was also resolved.
- AWS RDS utilisation was no longer hitting 100% and remained well within limits, graphs depicting max and average CPU usage attached below.
What went wrong:
- None
Retrospective
- Code review queries thoroughly before moving them to production.
- Add indices to all commonly searched columns, if possible, automate this.
- Invest in alerting so that we're aware of such situations and act accordingly rather than depending on our users to bring this to our notice.
- Analyse pros and cons of turning critical APIs such as billing and collection synchronous.
Start doing | Stop doing | Keep doing |
---|---|---|
|
|
|
Action items
- Nithin DV (Unlicensed) Partitioning strategy for tables (Time & Tenant Based and on kafka topics)
- Ghanshyam Rawat Nithin DV (Unlicensed) Tarun Lalwani Strategy plan to Index data based on columns of the tables, Query Optimisation
- Ghanshyam Rawat Nithin DV (Unlicensed) Tarun Lalwani DB Script review process plan
- Abhishek Jain, Nithin DV (Unlicensed) To review what kind of data being captures and logged by Jaeger tracing, logs should be masked (Important)
- Gajendran C (Unlicensed) Nithin DV (Unlicensed) Production infra alerting and dashboards (ES, Kafka, Zuul, Jaeger, DB) to the roadmap.
- Tarun Lalwani Come up with strategy to do basic performance testing on services
- Nithin DV (Unlicensed), Tarun Lalwani To come up with the proof point on the need for synchronous calls to solve the specific use case.