This is a fully remote position, open to applicants in Brazil.

📋 Description

• Take charge of the technical strategy for our observability stack (Dash0, OpenTelemetry, Elasticsearch/Logstash/Fluent Bit), establishing instrumentation standards for Java and Node.js services while promoting the use of tracing, metrics, and structured logging.

• Create significant SLIs, SLOs, and error budgets, collaborating with engineering and product teams to incorporate them into genuine engineering decision-making.

• Serve as a senior commander during incident response, conducting blameless postmortems with thorough technical analysis and actionable follow-up.

• Enhance our on-call program to ensure it is humane and sustainable, prioritizing the reduction of unnecessary tasks and alert noise as a key engineering objective.

• Shape architectural decisions across the platform, delving deep into critical areas such as GKE, Kong, RabbitMQ, PostgreSQL, MongoDB Atlas, Redis, and MinIO.

• Coach SREs and platform engineers, elevate the technical standards through design and incident reviews, and contribute to the growth of the SRE discipline at Digibee.

⛳️ Requirements

• A minimum of 8 years in SRE, infrastructure, or platform engineering, with considerable experience at the Specialist or Principal level managing large-scale production systems — this is a non-negotiable requirement.

• Hands-on production experience with Kubernetes (preferably GKE), demonstrating proficiency in troubleshooting issues under pressure.

• Extensive observability knowledge with OpenTelemetry, Prometheus, distributed tracing, and centralized logging (Elasticsearch, Logstash, Fluent Bit, or similar). Familiarity with Dash0 is highly advantageous.

• Practical experience managing stateful services in production: at least two from the following list: PostgreSQL, MongoDB Atlas, Redis, RabbitMQ, or object storage (MinIO/S3).

• Experience in instrumenting and troubleshooting Java services (including JVM tuning, GC, thread dumps); knowledge of Node.js runtime characteristics is a plus.

• Proven success in leading incident response and SLO programs that have authentically influenced engineering practices — not just dashboards that go unnoticed.

• Demonstrated capability to mentor senior engineers and shape technical direction across teams without formal authority.

• Proficient communication skills in both English and Portuguese (written and verbal), with the ability to collaborate effectively in cross-functional, remote teams.

🏝️ Benefits

• Flexibility and autonomy at work

• Opportunity for growth and real impact

Site Reliability Engineer

📋 Description

⛳️ Requirements

🏝️ Benefits

People also viewed

Senior DevOps Engineer

Kubernetes Site Reliability Engineer

DevOps confirmé

DevOps Engineer, Cloud

DevOps Engineer – Part-Time

Mid Cloud Product Reliability Engineer

Never miss a great job!