Remotery

Site Reliability Engineer

Posted 22 hours ago

📋 Description

• Take charge of the technical strategy for our observability stack (Dash0, OpenTelemetry, Elasticsearch/Logstash/Fluent Bit), establishing instrumentation standards for Java and Node.js services while promoting the use of tracing, metrics, and structured logging.

• Create significant SLIs, SLOs, and error budgets, collaborating with engineering and product teams to incorporate them into genuine engineering decision-making.

• Serve as a senior commander during incident response, conducting blameless postmortems with thorough technical analysis and actionable follow-up.

• Enhance our on-call program to ensure it is humane and sustainable, prioritizing the reduction of unnecessary tasks and alert noise as a key engineering objective.

• Shape architectural decisions across the platform, delving deep into critical areas such as GKE, Kong, RabbitMQ, PostgreSQL, MongoDB Atlas, Redis, and MinIO.

• Coach SREs and platform engineers, elevate the technical standards through design and incident reviews, and contribute to the growth of the SRE discipline at Digibee.


⛳️ Requirements

• A minimum of 8 years in SRE, infrastructure, or platform engineering, with considerable experience at the Specialist or Principal level managing large-scale production systems — this is a non-negotiable requirement.

• Hands-on production experience with Kubernetes (preferably GKE), demonstrating proficiency in troubleshooting issues under pressure.

• Extensive observability knowledge with OpenTelemetry, Prometheus, distributed tracing, and centralized logging (Elasticsearch, Logstash, Fluent Bit, or similar). Familiarity with Dash0 is highly advantageous.

• Practical experience managing stateful services in production: at least two from the following list: PostgreSQL, MongoDB Atlas, Redis, RabbitMQ, or object storage (MinIO/S3).

• Experience in instrumenting and troubleshooting Java services (including JVM tuning, GC, thread dumps); knowledge of Node.js runtime characteristics is a plus.

• Proven success in leading incident response and SLO programs that have authentically influenced engineering practices — not just dashboards that go unnoticed.

• Demonstrated capability to mentor senior engineers and shape technical direction across teams without formal authority.

• Proficient communication skills in both English and Portuguese (written and verbal), with the ability to collaborate effectively in cross-functional, remote teams.


🏝️ Benefits

• Flexibility and autonomy at work

• Opportunity for growth and real impact

People also viewed

Arctiq18 hours ago

Site Reliability Engineer

US flagVirginia OnlyFreelanceDevOps & Site Reliability Engineer (SRE)
ApplyView job
Arctiq18 hours ago

Senior Site Reliability Engineer

US flagVirginia OnlyFreelanceDevOps & Site Reliability Engineer (SRE)
ApplyView job
Software Mind18 hours ago

Senior DevOps Manager, German speaking

PL flagPoland OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Mediastream18 hours ago

DevOps Engineer

RO flagRomania OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Kyndryl18 hours ago

Site Reliability Engineer

US flagOhio OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$161.5k – $290.8k/year
ApplyView job
Guidehouse18 hours ago

Senior Azure DevOps Engineer

US flagUnited States OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$118k – $196k/year
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers