Remotery

Site Reliability Engineer – SRE

Posted 6 days ago

This is a fully remote position, open to applicants in Brazil.

📋 Description

• Define and monitor reliability metrics (SLI, SLO, SLA) and operate according to the Error Budget.

• Develop strategies for high availability, resilience, and disaster recovery (RTO/RPO).

• Perform capacity planning and analyze service performance.

• Enhance the reliability and performance of applications operating on Kubernetes.

• Design and improve system observability (logs, metrics, traces, and alerts).

• Create actionable dashboards and alerts aimed at increasing visibility while minimizing noise and false positives.

• Identify issues proactively through service instrumentation.

• Establish and manage the incident response process (classification, severity, on-call duties).

• Lead or assist in troubleshooting applications and distributed systems.

• Conduct root cause analyses (RCA) and post-mortems, and recommend preventative measures.

• Develop and maintain operational runbooks.

• Automate operational tasks and incident responses (self-healing) to eliminate repetitive manual processes.

• Utilize AI for log analysis, anomaly detection, troubleshooting, and optimization (AIOps).

• Consistently apply the principle “automate before repeating” to enhance operational maturity.

• Collaborate with development and platform teams to foster continuous improvement in reliability.

• Cultivate a culture of reliability and best practices across teams.

• Implement security best practices in production environments (secrets, access control, segregation).

• Ensure traceability through logs, auditing, and events.

• Assist in compliance with standards such as ISO 27001 and DevSecOps practices.

• Integrate reliability and security (Security by Design).


⛳️ Requirements

• Experience or familiarity with observability tools (Grafana, Prometheus, Elastic, Dynatrace, or similar).

• Experience or familiarity with Kubernetes and container technologies (Docker).

• Knowledge of Linux and networking concepts (HTTP, DNS, TLS/SSL).

• Proficiency in scripting and automation (Shell, Python, or similar).

• Strong analytical skills with a focus on problem-solving.

• Regularly utilize AI in daily tasks with an automation-oriented mindset ("automate before repeating").

• Organized and autonomous with strong technical communication skills, comparable to a mid/senior Full Stack Developer involved in production projects.

• Quick learner with a continuous desire to acquire new knowledge.

• Empathetic towards customer needs.

• Commitment to delivering an exceptional customer experience.

• Team-oriented mindset; willing to offer and seek assistance.

• Excellent communication skills for interaction across various teams.

• Proactive and well-organized.

• Alignment with our core values: Honesty and Ethics; Excellence and Care in Deliverables; Recognition; Respect and Courtesy.

• Experience with SLI, SLO, and Error Budget frameworks.

• Expertise in troubleshooting distributed systems.

• Background in critical, high-availability environments.

• Familiarity with APM tools (Dynatrace, Datadog).

• Knowledge of OpenTelemetry and service instrumentation.

• Understanding of Kafka, Elasticsearch, or Redis.

• Experience with incident automation (self-healing) and Infrastructure as Code (IaC) using tools like Terraform and Ansible.

• Knowledge of Chaos Engineering and service mesh technologies.

• Experience applying AI to operational processes (AIOps, technical copilots).

• Background in regulated environments (government, legal, or financial sectors).


🏝️ Benefits

• Health plan: Comprehensive coverage for your health needs.

• Life insurance: Providing security and peace of mind for you and your family.

• Partner discounts: Access to services from pharmacies, nutritionists, and psychologists at special rates.

• Well-being app (Clude): Encouragement for physical activities and overall well-being.

• Total Pass: Access to a wide network of local gyms.

• Workplace exercise: Active breaks to maintain your physical health during work hours.

• Meal allowance: Available for CLT employment contracts.

• Caju Card: A special gift to celebrate your birthday month.

• Home office allowance: Financial support for establishing a comfortable and efficient workspace.

• Education assistance: Support for your academic and professional growth.

• Book allowance: Encouragement to enhance your knowledge through reading.

• Continuous development: Programs and initiatives designed to advance your career.

• Innovation program: A platform for you to share ideas and make a meaningful impact.

• Dual screen: Providing the right tools to boost your productivity.

• 100% remote position: Freedom to work from your preferred location.

• FreeDay: Enjoy a day off to recharge.

• Moment Off: We promote breaks for relaxation and disconnection.

• Time off for your graduation: We celebrate your educational achievements with you.

• Gift for new children of employees: A gesture to commemorate the arrival of a new family member.

• Welcome-back gift after paternity leave: Support for your transition back to work.

• Supportive and collaborative environment: A team that thrives together.

• Eco-friendly welcome kit: Start your journey with us in an environmentally conscious way.

• Sustainable culture: Engage in practical initiatives, such as promoting composting.

• Virtual social gatherings: Opportunities to celebrate and connect with the team.

• Ongoing engagement campaigns throughout the year.

People also viewed

Advanced Solutions International, Inc.12 hours ago

DevOps Reliability Engineer

AU flagAustralia OnlyFull-timeDevOps & Site Reliability Engineer (SRE)$90k – $110k/year
ApplyView job
Stone12 hours ago

Senior Site Reliability Engineer – Network

BR flagBrazil OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Replit1 day ago

Staff Site Reliability Engineer

EuropeFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Soum1 day ago

DevOps Engineer, Mid Level

EG flagEgypt OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Lakeside Software1 day ago

DevOps Engineer, Azure

IN flagIndia OnlyFull-timeDevOps & Site Reliability Engineer (SRE)
ApplyView job
Interval Group1 day ago

DevOps Engineer, mk8s

DE flagGermany OnlyFreelanceDevOps & Site Reliability Engineer (SRE)
ApplyView job

Never miss a great job!

Get handpicked remote jobs straight to your inbox weekly.

Trusted by 7,400+ designers