
Senior Site Reliability Engineer – GCP
Posted 6 days ago

Posted 6 days ago
This is a fully remote position, open to applicants in Brazil.
• We are looking for a Site Reliability Engineer (SRE) with extensive knowledge in monitoring, observability, and reliability engineering to enhance systems across both on-premises infrastructure and Google Cloud Platform (GCP).
• This position is mainly responsible for the design, operation, and enhancement of monitoring, alerting, and observability platforms, particularly focusing on Grafana and Kubernetes environments.
• Additionally, this role will provide backup support for the Application Support team during resource shortages or significant incidents, delivering L2/L3 technical assistance as needed.
• Responsibilities include Monitoring & Observability (Core Focus):
• - Manage and operate the monitoring and observability stack across on-prem and GCP infrastructures.
• - Design, create, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications.
• - Define, fine-tune, and uphold alerts to maintain a high signal-to-noise ratio.
• - Establish observability standards and best practices among teams.
• - Enhance visibility into system health, performance, and reliability.
• Site Reliability Engineering:
• - Implement SRE principles to boost availability, performance, and resilience.
• - Define and monitor SLIs, SLOs, and error budgets.
• - Engage in on-call rotations and respond to SEV incidents.
• - Lead or contribute to incident investigations and root cause analysis (RCA).
• - Initiate preventative measures to decrease recurring incidents.
• Kubernetes & Platform Reliability:
• - Support and monitor Kubernetes environments (GKE and on-prem clusters).
• - Oversee cluster health, capacity, and resource utilization.
• - Resolve platform-level issues affecting application reliability.
• - Collaborate with Platform and Engineering teams to implement reliability enhancements.
• Secondary Responsibilities (Backup Application Support):
• - These duties are engaged as necessary and are not part of regular operations.
• - Offer L2/L3 application support during:
• - Resource shortages in the support team.
• - High-severity incidents (SEVs).
• - High-demand support periods or escalations.
• - Diagnose and troubleshoot application issues using existing runbooks and dashboards.
• - Work together with Application Support and Engineering teams during incidents.
• - Ensure all actions, findings, and resolutions are recorded in ServiceNow (SNOW).
• - Strong experience as a **Site Reliability Engineer or Reliability Engineer**.
• - Extensive hands-on expertise with **Grafana** (dashboards, alerting, troubleshooting).
• - Solid background in monitoring and observability systems.
• - Production experience in managing **Kubernetes** environments.
• - Experience in supporting systems in both **GCP** and on-prem environments (mandatory).
• - Strong **Linux** systems knowledge and troubleshooting abilities.
• - Proficiency in **English** (both written and spoken).
• - Willingness to work in **PST time zone**.
• - Availability to participate in an **on-call rotation** that includes covering one weekend day. Time worked during weekends will be compensated with one day off during the week, per the established work schedule.
• Technology Stack:
• - Observability: Grafana, Prometheus, logging platforms.
• - Containers: Kubernetes (GKE and on-prem).
• - Cloud: Google Cloud Platform (GCP).
• - Operations: Linux, networking, infrastructure monitoring.
• - Incident Tools: PagerDuty, ServiceNow, Slack (or equivalents).
• Nice to have:
• - Experience supporting application teams during SEV incidents.
• - Knowledge of capacity planning and performance tuning.
• - Scripting skills (Python, Bash, etc.).
• - Experience with hybrid infrastructure environments.
• At Devsu, we are committed to creating an environment where you can excel both personally and professionally. By joining our team, you will benefit from:
• - A stable, long-term contract with opportunities for career advancement.
• - Private health insurance.
• - A remote-friendly culture that encourages work-life balance.
• - Ongoing training, mentorship, and learning programs to keep you at the forefront of the industry.
• - Complimentary access to AI training resources and cutting-edge AI tools to enhance your daily work.
• - A flexible Paid Time Off (PTO) policy as well as paid holidays.
• - Engaging, world-class software projects for clients in the US and LatAm.
• - Collaboration with some of the most talented software engineers in Latin America and the US, in a diverse work environment.
Join Devsu and explore a workplace that values your growth, supports your well-being, and empowers you to create a global impact.
Advanced Solutions International, Inc.
Stone
Replit
Soum
Get handpicked remote jobs straight to your inbox weekly.