Esta oferta de trabajo no está disponible en tu país.
Senior DevOps Engineer
EPAM SystemsChile
Hace más de 30 días
Descripción del trabajo
Responsibilities
Deploy, configure, and manage GPU-enabled Kubernetes clusters and standalone Linux compute environments to ensure optimal performance and workload scheduling
Implement and administer Volcano job scheduling, including queue configuration, POD execution, GPU allocation, and namespace quota enforcement
Oversee end-to-end Kubernetes environments, including namespaces, RBAC, resource quotas, and workload isolation strategies
Develop and maintain automation scripts using Python and Shell to streamline job submission, resource provisioning, and system reporting
Collaborate with orchestration, optimization, and observability teams to enhance scheduling efficiency, capacity utilization, and researcher workflows
Monitor the health and resource utilization of infrastructure, providing insights and data to support optimization and reporting requirements
Identify and recommend improvements for infrastructure, tooling, and automation workflows to enhance scalability, usability, and performance
Ensure seamless operational processes to deliver efficient experiences for researchers working on diverse AI and computational workloads
Requirements
At least 3 years of experience in DevOps or infrastructure engineering roles within large-scale, complex environments
Advanced proficiency in Kubernetes administration, including namespaces, POD scheduling / distribution, PVC, NFS, and resource quota management
Hands-on experience with Volcano scheduler, including GPU job execution, queue configuration, workload prioritization, and Kubernetes integration
Proven experience managing GPU cluster environments, both within Kubernetes and on standalone Linux compute nodes
Advanced Python scripting skills for automating infrastructure tasks, along with strong UNIX Shell scripting expertise (e.g., Bash)
Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management
Solid understanding of infrastructure automation and orchestration concepts and tools
Fluent English skills, both written and spoken, at B2+ level or higher
Nice to have
Experience with Helm package management for Kubernetes applications
Knowledge of monitoring and observability tools, including Prometheus, Grafana, and Loki
Familiarity with Infrastructure as Code tools such as Terraform
Multi-cloud Kubernetes experience across platforms like Amazon EKS and Google GKE
Understanding of Azure networking concepts, including VPN, ExpressRoute, and network security
Experience with AI-assisted coding tools such as GitHub Copilot, ChatGPT, or Claude
Knowledge of hybrid environments combining cloud and on-premises resource scheduling and optimization
We offer
International projects with top brands
Work with global teams of highly skilled, diverse peers
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn
Job details
Seniority level : Mid-Senior level
Employment type : Full-time
Job function : Engineering, Information Technology, and Business Development
Industries : Software Development, IT Services and IT Consulting, and Technology, Information and Internet