This is a fully remote position, open to applicants in Taiwan.

• Oversee and maintain container orchestration systems and containerized applications.

• Monitor and diagnose production environments, taking part in on-call rotations to guarantee reliability.

• Promote observability enhancements by refining monitoring, logging, and alerting functionalities across various systems and data platforms.

• Manage and optimize cloud environments across several providers.

• Support and manage distributed data platforms and real-time processing systems.

• Create and uphold continuous integration and delivery pipelines for effective and dependable deployments.

• Lead and apply Infrastructure as Code (IaC) methodologies to ensure uniformity and scalability.

• Automate and manage infrastructure utilizing programming and scripting languages.

• Conduct system administration and networking activities to assist both internal and external environments.

• Collaborate efficiently with engineers and stakeholders across various time zones.

• A minimum of 5 years in Site Reliability Engineering, DevOps, or Platform Engineering roles.

• Proven track record of managing large-scale production systems in cloud environments (AWS, GCP, Azure, or OCI).

• Demonstrated capability in leading incident response, on-call best practices, and fostering a reliability-oriented culture.

• Strong background in production on-call operations and incident management.

• Advanced expertise in Kubernetes administration and troubleshooting.

• Practical experience with observability tools such as Prometheus, Grafana, Loki, and Alertmanager.

• Familiarity with chat-based operational interfaces and/or auto-remediation controllers utilizing AI agentic frameworks.

• Understanding of AI agents for auto-triaging alerts, correlating signals, and proposing root-cause hypotheses.

• Proficient in operating data platforms (Elasticsearch, MongoDB, Spark, Kafka, Redis).

• Competence with public cloud services (AWS, Azure, GCP, or OCI).

• Strong programming and automation skills in Python and Bash.

• Deep knowledge of Infrastructure as Code (Terraform, Helm).

• Experience with CI/CD pipelines (GitHub Actions, Bitbucket, ArgoCD).

• Solid technical foundation in distributed systems, databases, networking, and Linux administration.

• Excellent problem-solving, communication, and leadership skills.

• Bachelor’s degree in Computer Science, Engineering, or a related technical discipline.

• Certifications in AWS, GCP, Observability, Linux, or Kubernetes are advantageous.

• Competitive salary and performance-based bonuses.

• Comprehensive health, dental, and vision insurance.

• Flexible working hours and remote work options.

• Opportunities for professional development and continuous learning.

• Collaborative and inclusive work environment.

Senior SRE Engineer

People also viewed