Lead Reliability Engineer

Houston, TX 77002

Posted: 05/02/2017 Category: Java, Opportunistic, Python Job Number: 8905

Lead Reliability Engineer (Leading a small team of 2.  Need a Player/Coach. 

At Our Firm, we operate at scale.  Our data analysis, modeling and trading systems all operate in environments where growth is continuous and stable operation at scale is critical.  Software engineers within our Reliability Engineering Technology team are charged with developing the tools which make scalable distributed systems at our firm  possible.  These tools include test frameworks and infrastructure; logging, monitoring, and metrics collection; dashboarding and alerting; and coordination and deployment for distributed systems.  As a member of this group of versatile full stack   engineers, your remit will include:

  • Acting as the technology arm of Reliability Engineering – our  core DevOps organization;
  • Developing or improving foundational technology used by our engineering teams to build distributed services;
  • Improving all aspects of software reliability, including better monitoring, alerting and documentation;
  • Engaging with our software engineering teams on improvements to our tools, processes and software;
  • Support of some core services used by our  engineering teams;

Requirements include:
  • A bachelor’ s degree in computer science or another highly technical, scientific discipline.
  • Ability to program (structured and OO) with one or more high level languages (such as Python, Java, C/C++).
  • Familiarity with open source tools used for deployment, logging and monitoring (e.g. Ansible, Elastic Search, InfluxDB, Prometheus)
  • Familiarity with resource management frameworks such as Mesos, Kubernetes and Yarn
  • A proven track record of automation and an algorithmic approach to solving problems.

Additional skills preferred:

  • In-depth knowledge and experience in at least one of: host based networking, linux/unix administration, systems programming, distributed systems, databases, cloud computing, and a desire to learn more.
  • A proactive approach to spotting problems, areas for improvement, performance bottlenecks, etc.
  • An understanding of the operational concerns in a demanding environment; ideally, but not necessarily, finance.
  • The ability to understand the inherent trade-offs between various software architectures as it relates to performance, resiliency/fault tolerance, load balancing, data consistency.
Python, Java, C/C++, Linux, UNIX

Jordan Zmick

Send an email reminder to:

Share This Job:

Related Jobs:

Login to save this search and get notified of similar positions.