/
RCA - PB-PROD - Production Issues Mar 2, 2020

RCA - PB-PROD - Production Issues Mar 2, 2020

 

Date

Mar 2, 2020

Team

Delivery, DevOps

Participants

@Gajendran C (Unlicensed) @Raju Singh (Unlicensed) @Chandar Muthukrishnan @Aayush Sharma @Sammeer Rawat (Unlicensed) @Satyam K Ashish (Unlicensed) @Tarun Lalwani

Background

Lately report service being used extensively by the ULBs during the peak hours and it was consuming most of Master DB capacity, due to its extensive DB Queries, this was a bottleneck for other application services to simultaneously leverage the DB capacity. Although the report DB query optimisation task by the PMIDC is still in-progress and to be fixed, to mitigate the application performance issues, eGov Infra team took a tactical decision to introduce a read replica DB that deviates the report query traffic being cut from the master DB. As part of this exercise eGov Infra team had to do below changes which was not tested thoroughly on a UAT since the reaction time window was less during the rebate period and had to go about these changes during the weekend, where the impact testing efforts were less.

Changes Done:

  1. DB Read Replica: Why? To make report service to read the data from standalone replica that sync the data from master.

  2. Deployment tool migration to Helm: Why? UAT and Prod were using 2 different deployment configs that was missing many configs and getting ambiguous to maintain. we had to unify and simplify to the teams to maintain better.

  3. Kafka went down: since 1/3 kafka nodes lost network connection unexpectedly and this the first time we observed this infra behaviour in AWS, and this happened at the peak transaction time, to fix the same we had to restart the node to recover back, this led to 45 Min of downtime.

Impact:

  • Many ULBs immediately encountered the reports not being generated. (MC)

  • Double Receipt creation (ambiguous config error, pointing to digit v2 collection queue)

  • Payment Gateway not working (Missing configs)

  • Missing receipt numbers (Impact of kafka being down)

  • Report not working (due to read replica)

  • Firecess not working (config flag to enable firecess was not added)

  • Not able to make any payments (Impact of kafka being down)

Current state:

  • All the above issue were largely addressed in reaction.

Retrospective

Start doing

Stop doing

Keep doing

Start doing

Stop doing

Keep doing

  • Take enough testing window and ensure QA is available to do sanity while making any such changes to production.

  • Prioritise working on efficient automated monitoring and alerting system to proactively find issues ahead of the end user.

  • Bulk or big changes without taking enough testing window and plan critical tests to cover the impact.

  • Tactical fixes on PROD

  • Inform pmidc in advance and include them in the process.

  • Process with changes only with the consent despite the tactical situation.

 

Related content