This is a fully remote position, open to applicants in Brazil.

📋 Description

• We are looking for a Site Reliability Engineer (SRE) with extensive knowledge in monitoring, observability, and reliability engineering to enhance systems across both on-premises infrastructure and Google Cloud Platform (GCP).

• This position is mainly responsible for the design, operation, and enhancement of monitoring, alerting, and observability platforms, particularly focusing on Grafana and Kubernetes environments.

• Additionally, this role will provide backup support for the Application Support team during resource shortages or significant incidents, delivering L2/L3 technical assistance as needed.

• Responsibilities include Monitoring & Observability (Core Focus):

• - Manage and operate the monitoring and observability stack across on-prem and GCP infrastructures.

• - Design, create, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications.

• - Define, fine-tune, and uphold alerts to maintain a high signal-to-noise ratio.

• - Establish observability standards and best practices among teams.

• - Enhance visibility into system health, performance, and reliability.

• Site Reliability Engineering:

• - Implement SRE principles to boost availability, performance, and resilience.

• - Define and monitor SLIs, SLOs, and error budgets.

• - Engage in on-call rotations and respond to SEV incidents.

• - Lead or contribute to incident investigations and root cause analysis (RCA).

• - Initiate preventative measures to decrease recurring incidents.

• Kubernetes & Platform Reliability:

• - Support and monitor Kubernetes environments (GKE and on-prem clusters).

• - Oversee cluster health, capacity, and resource utilization.

• - Resolve platform-level issues affecting application reliability.

• - Collaborate with Platform and Engineering teams to implement reliability enhancements.

• Secondary Responsibilities (Backup Application Support):

• - These duties are engaged as necessary and are not part of regular operations.

• - Offer L2/L3 application support during:

• - Resource shortages in the support team.

• - High-severity incidents (SEVs).

• - High-demand support periods or escalations.

• - Diagnose and troubleshoot application issues using existing runbooks and dashboards.

• - Work together with Application Support and Engineering teams during incidents.

• - Ensure all actions, findings, and resolutions are recorded in ServiceNow (SNOW).

⛳️ Requirements

• - Strong experience as a **Site Reliability Engineer or Reliability Engineer**.

• - Extensive hands-on expertise with **Grafana** (dashboards, alerting, troubleshooting).

• - Solid background in monitoring and observability systems.

• - Production experience in managing **Kubernetes** environments.

• - Experience in supporting systems in both **GCP** and on-prem environments (mandatory).

• - Strong **Linux** systems knowledge and troubleshooting abilities.

• - Proficiency in **English** (both written and spoken).

• - Willingness to work in **PST time zone**.

• - Availability to participate in an **on-call rotation** that includes covering one weekend day. Time worked during weekends will be compensated with one day off during the week, per the established work schedule.

• Technology Stack:

• - Observability: Grafana, Prometheus, logging platforms.

• - Containers: Kubernetes (GKE and on-prem).

• - Cloud: Google Cloud Platform (GCP).

• - Operations: Linux, networking, infrastructure monitoring.

• - Incident Tools: PagerDuty, ServiceNow, Slack (or equivalents).

• Nice to have:

• - Experience supporting application teams during SEV incidents.

• - Knowledge of capacity planning and performance tuning.

• - Scripting skills (Python, Bash, etc.).

• - Experience with hybrid infrastructure environments.

🏝️ Benefits

• At Devsu, we are committed to creating an environment where you can excel both personally and professionally. By joining our team, you will benefit from:

• - A stable, long-term contract with opportunities for career advancement.

• - Private health insurance.

• - A remote-friendly culture that encourages work-life balance.

• - Ongoing training, mentorship, and learning programs to keep you at the forefront of the industry.

• - Complimentary access to AI training resources and cutting-edge AI tools to enhance your daily work.

• - A flexible Paid Time Off (PTO) policy as well as paid holidays.

• - Engaging, world-class software projects for clients in the US and LatAm.

• - Collaboration with some of the most talented software engineers in Latin America and the US, in a diverse work environment.

Join Devsu and explore a workplace that values your growth, supports your well-being, and empowers you to create a global impact.

Senior Site Reliability Engineer – GCP

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

DevOps Reliability Engineer

Senior Site Reliability Engineer – Network

Staff Site Reliability Engineer

DevOps Engineer, Mid Level

DevOps Engineer, Azure

DevOps Engineer, mk8s

Never miss a great job!