/
APIs Slowdown & Receipt creation issue (RCA)
APIs Slowdown & Receipt creation issue (RCA)
Background of the problem:
- During the March 2019 peak traffic multiple issues were raised by PMIDC related to long response times of services and few errors (due to a combination of the asynchronous nature of our APIs and the UI not handling these well).
- DB CPU utilisation was being maxed out due to long running queries; alerts for the same were received via AWS RDS SNS.
How did we react:
- Analyzed AWS RDS utilisation graphs to derive possible patterns of high utilization, which helped us narrow down the key hour wherein we had the highest utilization.
- Based on the metrics above we dug deep to analyze API response times and detailed trace of services using our distributed tracing setup, Jaeger.
- Several APIs were taking ~20-40s with majority of the time being spent querying the database; on analyzing the queries it was clear that they were not running optimally.
What was done as part of the exercise:
- Long running queries were analyzed using Postgres tools, and fixed by adding necessary indices on UAT then on PROD.
- Further, modules were analyzed and indices were added on all commonly searched columns.
- Monitoring was increased in an attempt to stay on top of such situations.
Impact of the exercise:
What went well:
- API response times improved from ~20-40s to ~2s, this resulted in great performance improvement at every service also as a whole system, graphs attached below.
- Random errors which occurred due to slow asynchronous persistance of data was also resolved.
- AWS RDS utilisation was no longer hitting 100% and remained well within limits, graphs depicting max and average CPU usage attached below.
What went wrong:
- None
Retrospective
- Code review queries thoroughly before moving them to production.
- Add indices to all commonly searched columns, if possible, automate this.
- Invest in alerting so that we're aware of such situations and act accordingly rather than depending on our users to bring this to our notice.
- Analyse pros and cons of turning critical APIs such as billing and collection synchronous.
Start doing | Stop doing | Keep doing |
---|---|---|
|
|
|
Action items
- Nithin DV (Unlicensed) Partitioning strategy for tables (Time & Tenant Based and on kafka topics)
- Ghanshyam Rawat Nithin DV (Unlicensed) Tarun Lalwani Strategy plan to Index data based on columns of the tables, Query Optimisation
- Ghanshyam Rawat Nithin DV (Unlicensed) Tarun Lalwani DB Script review process plan
- Abhishek Jain, Nithin DV (Unlicensed) To review what kind of data being captures and logged by Jaeger tracing, logs should be masked (Important)
- Gajendran C (Unlicensed) Nithin DV (Unlicensed) Production infra alerting and dashboards (ES, Kafka, Zuul, Jaeger, DB) to the roadmap.
- Tarun Lalwani Come up with strategy to do basic performance testing on services
- Nithin DV (Unlicensed), Tarun Lalwani To come up with the proof point on the need for synchronous calls to solve the specific use case.
, multiple selections available,
Related content
RCA - PB-PROD - Production Issues Mar 2, 2020
RCA - PB-PROD - Production Issues Mar 2, 2020
More like this
Analyzing PostgreSQL RDS Workload with pgbadger
Analyzing PostgreSQL RDS Workload with pgbadger
More like this
Punjab kibana dashboard loosing visualisation (RCA)
Punjab kibana dashboard loosing visualisation (RCA)
More like this
National Dashboard API Performance Testing and Benchmark
National Dashboard API Performance Testing and Benchmark
More like this
National Dashboard API Performance Testing Specs and Benchmark
National Dashboard API Performance Testing Specs and Benchmark
More like this
Dashboard Analytics Performance Enhancement
Dashboard Analytics Performance Enhancement
More like this