Deployment best practices Checklist
Daily Tasks:
- Monitor the status of the environment and ensure every single service is running (https://<domain_monitoring>/monitoring
- Monitoring the resources of environments https://central-dashboard.digit.org/d/gzIcCaiVz/kubernetes-cluster-ram-and-cpu-utilization
- Monitoring alerts-overview dashboard and taking appropriate action on critical and warning alerts https://central-dashboard.digit.org/d/smo98XK4z/alerts-overview?orgId=1&refresh=30s
- Keep track of all tasks by creating tickets
- In the Slack channel, watch the Prometheus Alters
- Monitor the Kafka consumer group lags https://<domain_name>/monitoring/d/N9uZBy8Wz/1-kubernetes-cluster-overview-kubrnettes?viewPanel=137&orgId=1and
- In case of Kafka-related issues, troubleshoot them https://core.digit.org/guides/operations-guide/kafka-troubleshooting-guide
Weekly Tasks:
- Monitoring the resources of environments https://central-dashboard.digit.org/d/gzIcCaiVz/kubernetes-cluster-ram-and-cpu-utilization
- Monitoring alerts-overview dashboard and taking appropriate action on critical and working alerts https://central-dashboard.digit.org/d/smo98XK4z/alerts-overview?orgId=1&refresh=30s
- Monitor the Kubecost Dashboard https://central-dashboard.digit.org/d/JOUdHGZZz/kubecost-dashboard-for-grafana-cloud?orgId=1
- Cleanup logs
- Backup logs
- Weekly DB dump in case of SDC
- ES Data backup
- Publish Weekly Summary report/Come up with the format
- Publish the JIRA board status
Monthly Tasks:
- Publish the Jira Board status report
- Create a new Jira sprint and maintain it
- Publish environments resources status report
- Publish environments cost report
- If you have tackled a new problem, publish troubleshooting documents