This is a fully remote position, open to applicants in Brazil.

• Ensure the availability, scalability, and performance of applications and services;

• Implement and enhance practices of **observability**, including metrics, logs, and traces;

• Create and maintain **dashboards (dashs)** for monitoring system health indicators;

• Define and manage **alerting systems**, focusing on efficient alerts and reducing noise;

• Engage in identifying and resolving incidents, conducting root cause analysis (RCA);

• Collaborate with development teams for continuous improvement (DevOps);

• Automate operational routines and monitoring processes;

• Support the definition and tracking of SLIs, SLOs, and SLAs;

• Contribute to the culture of reliability and resilience engineering.

• Experience with **SRE/DevOps** practices;

• Strong knowledge of **observability** (monitoring, logging, and tracing);

• Experience in building **dashboards and visualizing operational data**;

• Experience with **alert management (alerting systems)**;

• Familiarity with monitoring and observability tools, such as:

• Elastic Stack (Elasticsearch, Logstash, Kibana);

• Datadog;

• Splunk;

• Dynatrace;

• Knowledge in cloud environments (AWS, Azure, or GCP);

• Experience with automation (Python, Shell Script, or similar);

• Understanding of Linux systems and networking;

• Experience with containers and orchestration (Docker/Kubernetes);

• Experience with APM (Application Performance Monitoring) tools;

• Knowledge in infrastructure as code (Terraform, CloudFormation);

• Experience with CI/CD pipelines;

• Familiarity with Chaos Engineering practices;

• Certifications in cloud or SRE;

• Experience with business-oriented observability culture.

• null

Site Reliability Engineer – Senior

People also viewed