Job Description
Job Summary:
HappyRobot is a voice AI tool that automates phone operations used in the logistics and fleet management sectors. They are seeking a Site Reliability Engineer to lead the scaling of operational resilience, ensuring system stability and observability while improving developer focus and system uptime.
Responsibilities:
• Own the stability, observability, and debugging workflows that keep our systems running smoothly.
• Be the go-to person for untangling complex failures in real time.
• Design tools that turn chaos into clarity.
• Help shift from reactive to proactive operations.
• Reduce incident load, build internal tooling, and directly improve developer focus and system uptime.
Qualifications:
Required:
• 1+ years of hands-on experience debugging production systems (logs, traces, incidents, etc.)
• Strong problem-solving skills and ability to dive into unfamiliar backend codebases
• Comfort with Python and Go for reading code and writing small tools/utilities
• Familiarity with observability and monitoring tools (e.g., Datadog, Prometheus, Sentry)
• Clear, calm communication under pressure — especially during live incidents
Preferred:
• Experience working with distributed systems or services at scale
• Built or maintained internal tooling for on-call teams or reliability workflows
• Familiarity with deployment pipelines, CI/CD, or infra-as-code
• Experience improving system observability (e.g., custom metrics, traces, log pipelines)
Company:
HappyRobot is a voice AI tool that automates phone operations used in the the logistics and fleet management sectors. Founded in 2022, the company is headquartered in San Francisco, California, USA, with a team of 11-50 employees. The company is currently Early Stage. HappyRobot has a track record of offering H1B sponsorships.