We’re seeking an experienced Senior Systems Engineer (HPC) to administer and support our High Performance Computing cluster. You will work closely with engineering and data management teams on cutting-edge technologies.
This is an extraordinary opportunity to be part of a high-performing team and to pursue a life-changing mission with unique technical challenges!
You must have working rights for the US in order to be eligible for this role. We will sponsor applicants who have existing visa needs.
- Design, plan, test and implement innovative hardware designs for an HPC environment
- Implement, support, and provide technical guidance for engineering team initiatives and projects
- Build automation for infrastructure provisioning, configuration management, and account access (emphasis on SaltStack)
- Install, provision, and support complex Cisco Nexus HPC switching environment (RoCE)
- Responsible for the design structure and maintenance of an Pure Storage and Qumulo enterprise network attached storage system (NAS).
- Regularly evaluate and recommend new tools and technologies for use in existing and future clusters
- Deploy patches and updates to operating systems and application software
Required Skills and Experience
- Master’s in Computer Science, engineering, information systems or related field, or equivalent years’ experience
- 8+ years’ experience in systems engineer role
- Deep knowledge of server components CPU, SSD, GPU, Networking
- Deep knowledge of High Performance Computing (HPC) / Cluster technologies with high-speed interconnect fabrics using Ethernet/RoCE and Infiniband
- Expert knowledge of SAN and NAS services (iSCSI, NFS, CIFS)
- Expert knowledge of TCP/IP networking, network security, and DNS (BIND, Windows)
- Expert knowledge of Linux (Ubuntu, CentOS), common UNIX services, and Shell scripting
- Strong understanding of high speed HPC interconnects
- Strong knowledge of parallel GPU computing, MPI, and RDMA within containerized environments
- Strong knowledge of NVIDIA software environment, NCCL, NGC, GPU tools
- Strong experience working with operation and administration of workload schedulers such as Slurm, LSF, or PBS
- Strong knowledge of virtualization technologies such as KVM/libvirt/QEMU
- Experience working with configuration management tools like SaltStack, Chef, or Puppet
- Working knowledge of kubernetes and docker containers within an on-prem HPC cluster
- Understanding of data pipelines to include ETL and streaming data such as log data or tool/sensor data to indexes (EMR)
- Understanding of cloud platforms and services, particularly AWS
- Understanding of Jupyter Notebook technology
- Understanding of CI/CD pipelines
- Understanding of Agile development methodologies