We grew the SRE team and onboarded 3 new members in Q1. Here’s the checklist we followed:
- Access to logging/monitoring tools (Grafana, Kibana, Prometheus)
- Intro to on-call playbooks and runbooks
- Sample incident postmortems
- Walkthrough of CI/CD pipelines
- Shadowing live deploys
Bonus: we created an SRE lab repo where they could break/recover toy services without fear.
Reduced onboarding time from 3 weeks to 5 days.