Research Computing SRE
New York City, NY 10019
The Research Infrastructure Cloud HPC team is a group of experts solving computing problems in the critical path of Research. We work directly with Research and Model Implementation teams and provide them with tools and compute resources to take their ideas from inception to real tradable products. We are looking for an ambitious and operationally minded software engineer to join our team as we mature and scale our cloud HPC platform from a successful strategy-specific offering to the next iteration of our firm-wide Research platform.
We are a small flat team sitting at the cross-section of research, implementation, and systems infrastructure. Our team responsibilities span many areas. Below find a sampling of the types of work you will be expected to work on:
- Design and implementation of cloud-based HPC systems. Our projects typically involve equal parts engineering and operations for success in our fast-moving environment. You will be expected to do both for projects small and large.
- Running our HPC plant day-to-day. Our research environment is up 24/7, and we want to keep it that way. Everybody on the team contributes to the support of our plant, which thankfully is light because of our automation and quality work.
- Implementing automation. We will always choose to work smart over working hard. You will be responsible for conception and implementation of automation from CI/CD pipelines to production metrics and monitoring of our cloud HPC platform.
- Capacity management and benchmark optimization. Our demand for compute is constant and involves challenging problems focused on scaling our compute and optimizing it for research-critical workloads.
- Obsessive User Focus. All members of the team are expected to partner with researchers and engineers to deliver high-quality cloud HPC systems that are efficient and reliable. This includes leading projects to evolve it as our needs change.
- 5+ years of software engineering and/or systems programming experience
- 2+ years of experience working with a public cloud, AWS preferred
- Mastery of at least one programming language building production systems, Python preferred
- Experience with a production configuration management tool, Salt/SaltStack preferred
- Experience with a cloud-based infrastructure-as-code tool, Terraform preferred
- Excellent written and verbal communication skills
- Past experience working with or supporting researchers and/or other developers is a plus
- Knowledge of Slurm or similar HPC schedulers and resource managers is a plus
- Bachelor’ s degree in computer science, engineering, or a related field from a strong academic program.