Site Reliability Engineer

November 13, 2025
Apply Now

Job Description

If this blog helped you, spread the word!

Dear Applicant,

**Role: Site Reliability Engineer (SRE) – Onsite**

**Location: Columbus OH / Austin/ Charlotte NC**

**(Full Time) Visa Type: USC/GC preferred. H1b/H4EAD accepted**

**Experience – 7 to 9 yrs**

**Job Description:**

**Position Summary:**
As a Cloud Infrastructure Site Reliability Engineer (SRE) with expertise in multiple public cloud service provider platforms, you will be responsible for operating infrastructure solutions, following the principles and practices pioneered by Google’s SRE model. Your work will ensure our cloud services meet uptime, reliability, and performance targets, and you will drive automation and continuous improvement across our production environments. This role will involve collaborating with Cross functional teams to enhance our cloud reliability posture and streamline processes through automation.

**Key Responsibilities:**

Design, build, and maintain highly available, scalable, and secure cloud infrastructure on platforms such as
**AWS, GCP, or Azure**
.

Develop and implement automation for provisioning, monitoring, scaling, and incident response using Infrastructure-as-Code tools (e.g.,
**Terraform, CloudFormation, Ansible)**
.

Monitor system reliability, capacity, and performance; proactively detect and address issues before they impact users. Good experience into SRE implementation of monitoring system-Dashboard development for application reliability
**using Splunk, Dynatrace, Grafana, App Dynamics, Datadog, Big Panda.**

Collaborate with software engineering and security teams to ensure new services and features are production-ready and meet reliability standards.

Build and maintain tools for deployment, monitoring, and operations; automate manual processes to reduce toil. Experience with Automation principals and tools (
**Ansible etc**
), should have worked with Toil Identification.

Document operational processes and system architectures to ensure knowledge sharing and repeatability.

**Qualifications:**

Bachelor’s degree in computer science, Engineering, or a related technical field, or equivalent practical experience.

3+ years of experience in software development with proficiency in at least one programming language (
**e.g., Python, Go, Java, Curl Scripting).**

Experience administrating cloud platforms (
**AWS, GCP, Azure**
), including networking, security, containerization, storage, data management, and serverless technologies.

Solid understanding of
**Unix/Linux systems, Windows Server, Oracle, MSSQL, MongoDB, networking**
fundamentals, virtualized, and distributed systems, and file systems. Deep understanding of observability (monitoring, alerting, and logging) tools in cloud environments. Ability to set up and maintain monitoring dashboards, alerts, and logs.

Experience with observability tools –
**AppDynamics, Geneos, Dynatrace, ECS Based Internal tooling, Grafana, Prometheus, Splunk, Thousand Eye**
etc.

Familiarity with
**Continuous Integration/Continuous Deployment**
(CI/CD) tools for automated testing, deployments, provisioning, and observability.

Ability to manage and respond to incidents, perform root cause analysis, and implement postmortem reviews.

Understanding of setting, monitoring, and maintaining Service-Level Objectives (SLOs) and Service Level Agreements (SLAs) for system reliability.

**Additional Qualifications a Plus:**

5+ years of experience in SRE, DevOps, infrastructure, or cloud engineering roles, preferably supporting large-scale, distributed systems.

Excellent problem-solving, troubleshooting, and communication skills.

Experience leading technical projects or mentoring junior engineers.

Certifications: Certified Engineer, DevOps, SRE, CSREF

If this blog helped you, spread the word!