Summary
We are looking for
DevOps / SRE team members who will take an active role in managing our
highly available and scalable infrastructure within our
Server Systems Management and DevOps team.
In this role, you will be responsible for ensuring the continuity of
critical systems running in
multi-cloud environments, primarily
AWS and
Huawei Cloud (HWC), and you will contribute to the improvement of
automation, monitoring, and incident management processes.
Education
- Bachelor's degree in Computer Engineering, Software Engineering, or a related engineering discipline, or equivalent practical experience.
Responsibilities
- Infrastructure management and optimization in AWS and Huawei Cloud environments
- Setup, management, and monitoring of Kubernetes clusters (production / pre-production)
- Setup and improvement of CI/CD pipelines
- Management of High Availability (HA) and Disaster Recovery (DR) scenarios
- Management of monitoring and alerting infrastructure using tools such as Prometheus, Grafana, Loki, etc.
- Performance, capacity, and cost optimization activities
- Participation in on-call rotation and response to critical incidents
- Improvement of security, logging, and backup processes
- Close collaboration with development teams to resolve deployment and runtime issues
Qualifications
- Proficiency in Linux system administration (preferably Ubuntu / CentOS)
- Proficiency in Docker and container concepts
- Proficiency in Kubernetes core components (Pods, Services, Ingress, Deployments, HPA, etc.)
- Proficiency in at least one CI/CD tool (GitLab CI, GitHub Actions, Jenkins, etc.)
- Knowledge of networking fundamentals (TCP/IP, DNS, NAT, Load Balancers, Firewalls)
- Proficiency in monitoring and logging concepts
- Strong problem-solving and analytical thinking skills
- Ability to remain calm and act systematically during critical situations
- Strong attention to documentation
- Strong teamwork skills
- Open to learning and self-improvement
- Ability to adapt to rotational on-call duty when required
- Strong sense of responsibility for minimizing downtime in critical systems
- Ability to adapt to flexible working hours based on operational requirements
Preferred
- Experience with AWS services (EC2, VPC, ALB/NLB, RDS, IAM, CloudWatch)
- Experience with Huawei Cloud (HWC) or other cloud service providers
- Knowledge of Infrastructure as Code tools (Terraform, Ansible, Helm)
- Knowledge of Prometheus, Grafana, Loki, ELK
- Experience working with systems such as OpenStack, Ceph, Couchbase, Elasticsearch
- Experience with on-call and incident management