2026
Staff Site Reliability Engineer
Own platform reliability for a multi-region streaming network serving 90M MAUs. Lead a 22-person SRE org across edge, control plane, and data infra.
- Designed a cell-based architecture rollout that contained 11 control-plane incidents to single cells — zero customer-facing outages in 18 months.
- Rewrote the autoscaling controller in Rust; reduced p99 scale-up latency from 38s to 4.2s and cut compute spend by $11M/yr.
- Established the org's first error-budget policy and quarterly reliability review; adopted by 9 product teams.
- Mentored 6 senior engineers to staff and 2 staff engineers to principal.