Research Computing SRE

New York City, NY 10019

Posted: 10/07/2019 Category: Cloud, Network/Systems, Opportunistic Job Number: 12354

Job Description


The Research Infrastructure Cloud HPC team is a group of experts solving computing problems in the critical path of Research. We work directly with Research and Model Implementation teams and provide them with tools and compute resources to take their ideas from inception to real tradable products. We are looking for an ambitious and operationally minded software engineer to join our team as we mature and scale our cloud HPC platform from a successful strategy-specific offering to the next iteration of our firm-wide Research platform.  

Responsibilities 

We are a small flat team sitting at the cross-section of research, implementation, and systems infrastructure. Our team responsibilities span many areas. Below  find  a sampling of the types of work you will be expected to work on: 
  • Design and implementation of cloud-based HPC systems.  Our projects typically involve equal parts engineering and operations for success in our fast-moving environment. You will be expected to do both for projects small and large. 
  • Running  our HPC plant day-to-day.  Our research environment is up 24/7,  and  we want to keep it that way.  Everybody  on the team  contributes to the support of our plant, which thankfully is  light because of our automation and quality work. 
  • Implementing automation.  We  will always choose to work  smart over working hard.  You will be  responsible for conception and implementation of automation  from  CI/CD pipelines  to  production  metrics and monitoring  of our  cloud  HPC platform. 
  • Capacity management  and  benchmark  optimization.  Our demand for compute  is  constant  and involves  challenging problems  focused on scaling  our  compute  and  optimizing  it for research-critical  workloads. 
  • Obsessive User Focus. All members of the team are expected to partner with researchers and engineers to deliver high-quality cloud HPC systems that are efficient and reliable. This includes leading projects to evolve it as our needs change. 

Qualifications 
  • 5+ years of software engineering and/or systems programming experience 
  • 2+  years of  experience working with a  public cloud, AWS preferred 
  • Mastery of at least one programming language  building  production systems, Python preferred 
  • Experience with a  production  configuration management tool, Salt/SaltStack  preferred 
  • Experience with a cloud-based infrastructure-as-code tool, Terraform preferred 
  • Excellent written and verbal communication skills 
  • Past experience  working with or supporting researchers and/or  other  developers  is  a plus  
  • Knowledge of  Slurm  or similar  HPC schedulers and resource managers  is  a plus 

Education
  • Bachelor’ s degree in computer science, engineering, or a related field from a strong academic program. 

 

Job Requirements

See notes

Meet Your Recruiter

Jordan Zmick

Send an email reminder to:

Share This Job:

Related Jobs:

Login to save this search and get notified of similar positions.