DevOps / Site Reliability Engineer (SRE)

Bayraktar Technologies

Bac Lieu, Vietnam

Fresher

This job is no longer accepting applications

Posted 3 months ago

Job Description

Summary

We are looking for DevOps / SRE team members who will take an active role in managing our highly available and scalable infrastructure within our Server Systems Management and DevOps team.

In this role, you will be responsible for ensuring the continuity of critical systems running in multi-cloud environments, primarily AWS and Huawei Cloud (HWC), and you will contribute to the improvement of automation, monitoring, and incident management processes.

Education

Bachelor's degree in Computer Engineering, Software Engineering, or a related engineering discipline, or equivalent practical experience.

Responsibilities

Infrastructure management and optimization in AWS and Huawei Cloud environments
Setup, management, and monitoring of Kubernetes clusters (production / pre-production)
Setup and improvement of CI/CD pipelines
Management of High Availability (HA) and Disaster Recovery (DR) scenarios
Management of monitoring and alerting infrastructure using tools such as Prometheus, Grafana, Loki, etc.
Performance, capacity, and cost optimization activities
Participation in on-call rotation and response to critical incidents
Improvement of security, logging, and backup processes
Close collaboration with development teams to resolve deployment and runtime issues

Qualifications

Proficiency in Linux system administration (preferably Ubuntu / CentOS)
Proficiency in Docker and container concepts
Proficiency in Kubernetes core components (Pods, Services, Ingress, Deployments, HPA, etc.)
Proficiency in at least one CI/CD tool (GitLab CI, GitHub Actions, Jenkins, etc.)
Knowledge of networking fundamentals (TCP/IP, DNS, NAT, Load Balancers, Firewalls)
Proficiency in monitoring and logging concepts
Strong problem-solving and analytical thinking skills
Ability to remain calm and act systematically during critical situations
Strong attention to documentation
Strong teamwork skills
Open to learning and self-improvement
Ability to adapt to rotational on-call duty when required
Strong sense of responsibility for minimizing downtime in critical systems
Ability to adapt to flexible working hours based on operational requirements

Preferred

Experience with AWS services (EC2, VPC, ALB/NLB, RDS, IAM, CloudWatch)
Experience with Huawei Cloud (HWC) or other cloud service providers
Knowledge of Infrastructure as Code tools (Terraform, Ansible, Helm)
Knowledge of Prometheus, Grafana, Loki, ELK
Experience working with systems such as OpenStack, Ceph, Couchbase, Elasticsearch
Experience with on-call and incident management