Job Summary:
The Senior System Engineer position is responsible for operating, troubleshooting, and optimizing large-scale cloud systems based on the OpenStack platform, with a strong focus on networking, SDN data plane, kernel interaction, container/runtime behavior, automation, and system performance analysis.
Key Responsibilities:
- Operate and troubleshoot components of OpenStack (such as Neutron, Nova, LB) or equivalent cloud platforms, focusing on tenant networking, routing, NAT, security groups, and production issue resolution.
- Analyze end-to-end packet flows, debug connectivity issues, packet loss, high latency, or unstable system behavior using tools such as tcpdump, iproute2, flow inspection, logs, and system traces.
- Work with SDN or virtual networking technologies such as Open vSwitch (OVS), Open Virtual Network (OVN), Tungsten Fabric/Contrail, VMware NSX, or equivalent solutions; possess a strong understanding of overlay networking models such as VXLAN, MPLS, and EVPN.
- Preferred additional experience includes: OpenStack Neutron, Tungsten Fabric/Contrail, EVPN/MPLS, VPN/IPSec, kernel tuning, Docker/containerd internals, or high PPS processing systems.
- Investigate performance bottlenecks, including PPS limitations, CPU saturation, NIC offload behavior, MTU mismatch, RSS, NUMA/CPU pinning, kernel network stack behavior, and feature compatibility across operating systems, kernels, drivers, and platform versions.
- Debug system-level issues related to the Linux kernel, Docker/container runtime behavior, differences between cgroup v1/v2, kernel modules, driver interactions, and feature mismatches across different distributions or kernel versions.
- Build or use automation to collect logs, inspect system configurations, validate runtime states, compare configurations across nodes, and support large-scale operational standardization using tools such as Ansible combined with shell or Python scripts.
- Handle production incidents, conduct root cause analysis, and coordinate with monitoring/logging systems to identify systemic issues and prevent recurrence.
Requirements:
- Strong Linux system troubleshooting skills, with a solid understanding of kernel interactions with system internals, networking stack, process/resource management mechanisms, and container runtime behavior such as Docker or containerd.
- Strong networking fundamentals, including TCP/IP, routing, NAT, and L2/L3 operations in virtualization and overlay network environments.
- Hands-on experience with virtual networking, SDN, or cloud networking platforms such as OpenStack, Kubernetes networking, VMware, or equivalent systems.
- Ability to debug issues using packet-level and system-level tools, rather than relying solely on configurations, management interfaces, or vendor documentation.
- Experience using automation/configuration management tools such as Ansible to collect logs, inspect system parameters, validate configuration consistency, and safely deploy operational changes across multiple nodes.
- Strong programming mindset, with the ability to read, review, and troubleshoot code or logic in Python, Go, Shell, or C/C++, and analyze root causes beyond standard runbooks.