Job Description
* **Release Management:** Coordinate and manage release cycles for observability platforms. Ensure smooth and timely releases with minimal disruption to services. Work with partners to migrate legacy monitoring to modern solutions. Work with the observability engineering team to provide solutions for new requirements that arise, by leveraging existing or developing new solutions.
* **Incident/Request Management:** Troubleshoot and resolve incidents related to observability platforms. Manage escalated customer issues and requests, ensuring timely and effective resolution. Document incident remediation activities and automate remediation activities where possible.
* **Performance Optimization:** Continuously monitor and enhance platform performance to support scalability and complexity.
* **Collaboration and Communication:** Collaborate with cross-functional infrastructure, application, and business stakeholders to ensure observability solutions align with the broader IT strategy and infrastructure requirements. Communicate effectively with team members, management, and other stakeholders.
* **Continuous Improvement:** Identify opportunities for process optimization and efficiency gains. Stay current with industry trends and best practices to continuously improve observability operations.
* **Customer Focus:** Ensure high levels of customer satisfaction by effectively managing customer relationships. Provide excellent customer service and support for observability solutions.
* **Compliance and Security:** Ensure observability platforms comply with organizational policies and security standards. Implement tools and processes to detect and remediate configuration drifts and security risks.
* **Documentation and Reporting:** Maintain comprehensive documentation of observability platform, Product DOU, processes, and procedures.
**Technical Expertise:**
* 5+ Years of experience in IT operations, with significant responsibilities in system monitoring,
* performance tuning, and troubleshooting enterprise applications.
* 4+ Years in a Site Reliability Engineering (SRE) role managing modern observability solutions.
* 5+ years of development experience on enterprise class applications: **Javascript/Java, Sql ,Spring boot & Micro services**
* 5+ Years managing and implementing observability and event management platforms **(e.g., AppDynamics, Splunk, Prometheus, Grafana).**
* **5+ years of experience of cloud computing platforms (GCP)** and container orchestration (e.g., Kubernetes, Docker)
* Familiarity with CI/CD pipelines and automation tools (e.g., Jenkins, GitLab , ArgoCD etc)
* Experience developing and implementing monitoring and logging standards for infrastructure, platforms, and applications.
* Experience establishing and implementing event correlation policies and related rules to enrich event data, and reduce TTD and TTR.
Job Type: Full-time
Pay: $90,000.00 – $110,000.00 per year
Application Question(s):
* What is your current work authorization status (e.g., U.S. Citizen, Green Card, GC EAD, H1B, etc.) for the location you applied for?
* Are you comfortable with the job location?
* Are you open to full-time employment opportunities?
* What is your Current CTC or hourly rate?
* What is your Expected CTC or hourly rate?
* What is your Current Location?
* Kindly share your LinkedIn profile link.
Experience:
* Release Operation Engineer: 6 years (Required)
* Site Reliability Engineer: 4 years (Required)
* Google Cloud Platform: 5 years (Required)
Work Location: In person