
Site Reliability Engineer
Posted 22 hours ago

Posted 22 hours ago
• Take charge of the technical strategy for our observability stack (Dash0, OpenTelemetry, Elasticsearch/Logstash/Fluent Bit), establishing instrumentation standards for Java and Node.js services while promoting the use of tracing, metrics, and structured logging.
• Create significant SLIs, SLOs, and error budgets, collaborating with engineering and product teams to incorporate them into genuine engineering decision-making.
• Serve as a senior commander during incident response, conducting blameless postmortems with thorough technical analysis and actionable follow-up.
• Enhance our on-call program to ensure it is humane and sustainable, prioritizing the reduction of unnecessary tasks and alert noise as a key engineering objective.
• Shape architectural decisions across the platform, delving deep into critical areas such as GKE, Kong, RabbitMQ, PostgreSQL, MongoDB Atlas, Redis, and MinIO.
• Coach SREs and platform engineers, elevate the technical standards through design and incident reviews, and contribute to the growth of the SRE discipline at Digibee.
• A minimum of 8 years in SRE, infrastructure, or platform engineering, with considerable experience at the Specialist or Principal level managing large-scale production systems — this is a non-negotiable requirement.
• Hands-on production experience with Kubernetes (preferably GKE), demonstrating proficiency in troubleshooting issues under pressure.
• Extensive observability knowledge with OpenTelemetry, Prometheus, distributed tracing, and centralized logging (Elasticsearch, Logstash, Fluent Bit, or similar). Familiarity with Dash0 is highly advantageous.
• Practical experience managing stateful services in production: at least two from the following list: PostgreSQL, MongoDB Atlas, Redis, RabbitMQ, or object storage (MinIO/S3).
• Experience in instrumenting and troubleshooting Java services (including JVM tuning, GC, thread dumps); knowledge of Node.js runtime characteristics is a plus.
• Proven success in leading incident response and SLO programs that have authentically influenced engineering practices — not just dashboards that go unnoticed.
• Demonstrated capability to mentor senior engineers and shape technical direction across teams without formal authority.
• Proficient communication skills in both English and Portuguese (written and verbal), with the ability to collaborate effectively in cross-functional, remote teams.
• Flexibility and autonomy at work
• Opportunity for growth and real impact
Arctiq
Arctiq
Software Mind
Mediastream
Get handpicked remote jobs straight to your inbox weekly.