Deployment best practices Checklist
Daily Tasks:
Monitor the status of the environment and ensure every single service is running (https://<domain_monitoring>/monitoring
Monitoring the resources of environments https://central-dashboard.digit.org/d/gzIcCaiVz/kubernetes-cluster-ram-and-cpu-utilization
Monitoring alerts-overview dashboard and taking appropriate action on critical and warning alerts https://central-dashboard.digit.org/d/smo98XK4z/alerts-overview?orgId=1&refresh=30s
Keep track of all tasks by creating tickets
In the Slack channel, watch the Prometheus Alters
Monitor the Kafka consumer group lags https://<domain_name>/monitoring/d/N9uZBy8Wz/1-kubernetes-cluster-overview-kubrnettes?viewPanel=137&orgId=1and
In case of Kafka-related issues, troubleshoot them https://core.digit.org/guides/operations-guide/kafka-troubleshooting-guide
Weekly Tasks:
Monitoring the resources of environments https://central-dashboard.digit.org/d/gzIcCaiVz/kubernetes-cluster-ram-and-cpu-utilization
Monitoring alerts-overview dashboard and taking appropriate action on critical and working alerts https://central-dashboard.digit.org/d/smo98XK4z/alerts-overview?orgId=1&refresh=30s
Monitor the Kubecost Dashboard https://central-dashboard.digit.org/d/JOUdHGZZz/kubecost-dashboard-for-grafana-cloud?orgId=1
Cleanup logs
Backup logs
Weekly DB dump in case of SDC
ES Data backup
Publish Weekly Summary report/Come up with the format
Publish the JIRA board status
Monthly Tasks:
Publish the Jira Board status report
Create a new Jira sprint and maintain it
Publish environments resources status report
Publish environments cost report
If you have tackled a new problem, publish troubleshooting documents