This is a fully remote position, open to applicants in Brazil.

• Ensure the availability and reliability of production systems;

• Implement and uphold monitoring, observability, and alerting solutions;

• Address incidents, conduct root cause analysis (RCA), and establish remediation plans;

• Automate operational tasks and repetitive processes (Infrastructure as Code);

• Collaborate with CI/CD pipelines for secure and continuous deployment;

• Manage and enhance cloud environments (AWS and GCP);

• Apply best practices for resilience, scalability, and fault tolerance;

• Define and monitor SLIs, SLOs, and SLAs;

• Support development teams in creating more resilient applications;

• Conduct capacity planning and cost optimization (basic FinOps);

• Document processes, architectures, and operational playbooks.

• Experience with cloud environments, particularly AWS and/or GCP;

• Proficiency in Linux/Unix systems;

• Experience with monitoring tools (e.g., Prometheus, Grafana, CloudWatch, Stackdriver);

• Understanding of containers and orchestration (Docker and Kubernetes);

• Experience with Infrastructure as Code (Terraform, CloudFormation, or similar);

• Automation expertise with languages such as Python, Bash, or Go;

• Familiarity with CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, etc.);

• Networking knowledge (VPC, DNS, load balancing);

• Awareness of cloud security (IAM, access policies, best practices);

• Experience with multi-cloud environments;

• Understanding of DevOps practices and agile culture;

• Proficiency with distributed observability tools (OpenTelemetry, Datadog, New Relic);

• Knowledge of service mesh (Istio, Linkerd);

• Experience with messaging systems (Kafka, Pub/Sub, SQS);

• Familiarity with Chaos Engineering practices;

• Understanding of FinOps (cloud cost management).

• Position is also available for candidates with disabilities (PwD).

SRE – AWS/GCP

People also viewed