Talent.com
Esta oferta de trabajo no está disponible en tu país.
Senior DevOps Engineer

Senior DevOps Engineer

EPAM SystemsChile
Hace más de 30 días
Descripción del trabajo

Responsibilities

  • Deploy, configure, and manage GPU-enabled Kubernetes clusters and standalone Linux compute environments to ensure optimal performance and workload scheduling
  • Implement and administer Volcano job scheduling, including queue configuration, POD execution, GPU allocation, and namespace quota enforcement
  • Oversee end-to-end Kubernetes environments, including namespaces, RBAC, resource quotas, and workload isolation strategies
  • Develop and maintain automation scripts using Python and Shell to streamline job submission, resource provisioning, and system reporting
  • Collaborate with orchestration, optimization, and observability teams to enhance scheduling efficiency, capacity utilization, and researcher workflows
  • Monitor the health and resource utilization of infrastructure, providing insights and data to support optimization and reporting requirements
  • Identify and recommend improvements for infrastructure, tooling, and automation workflows to enhance scalability, usability, and performance
  • Ensure seamless operational processes to deliver efficient experiences for researchers working on diverse AI and computational workloads

Requirements

  • At least 3 years of experience in DevOps or infrastructure engineering roles within large-scale, complex environments
  • Advanced proficiency in Kubernetes administration, including namespaces, POD scheduling / distribution, PVC, NFS, and resource quota management
  • Hands-on experience with Volcano scheduler, including GPU job execution, queue configuration, workload prioritization, and Kubernetes integration
  • Proven experience managing GPU cluster environments, both within Kubernetes and on standalone Linux compute nodes
  • Advanced Python scripting skills for automating infrastructure tasks, along with strong UNIX Shell scripting expertise (e.g., Bash)
  • Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management
  • Solid understanding of infrastructure automation and orchestration concepts and tools
  • Fluent English skills, both written and spoken, at B2+ level or higher
  • Nice to have

  • Experience with Helm package management for Kubernetes applications
  • Knowledge of monitoring and observability tools, including Prometheus, Grafana, and Loki
  • Familiarity with Infrastructure as Code tools such as Terraform
  • Multi-cloud Kubernetes experience across platforms like Amazon EKS and Google GKE
  • Understanding of Azure networking concepts, including VPN, ExpressRoute, and network security
  • Experience with AI-assisted coding tools such as GitHub Copilot, ChatGPT, or Claude
  • Knowledge of hybrid environments combining cloud and on-premises resource scheduling and optimization
  • We offer

  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn
  • Job details

  • Seniority level : Mid-Senior level
  • Employment type : Full-time
  • Job function : Engineering, Information Technology, and Business Development
  • Industries : Software Development, IT Services and IT Consulting, and Technology, Information and Internet
  • #J-18808-Ljbffr

    Crear una alerta de empleo para esta búsqueda

    Senior Engineer • Chile