Site Reliability Engineer

Ctrl IQ is a post Series Ap, focused on modernization of High Performance Computing (HPC) infrastructure and capabilities for not only traditional HPC (e.g. simulation, pharmaceuticals/medicine, energy, aerospace, financial services/trading, etc.) but also enterprise focused computing needs like AI/ML training and inferencing as well as compute and data analytics.

Ctrl IQ develops software infrastructure for the enterprise. We are the founding company behind Rocky Linux and have brought to market major advances within the traditional High Performance Computing (HPC) ecosystem (e.g. simulation, pharmaceuticals/medicine, energy, aerospace, financial services/trading, etc.) as well as enterprise focused computing needs like AI/ML training and inferencing as well as compute and data analytics.

For this position, we are seeking a talented and experienced software/site reliability engineer to build and maintain the infrastructure for our solution of cloud 2.0.

Successful candidates will have interest and experience in some of in the following areas: containers (Singularity, Docker, OCI, etc.), orchestration (Kubernetes/Nomad/Mesosphere), distributed workloads, data movement, AI/ML training, DevOps, container registries, security, PKI, encryption, etc.

SRE’s focus is to serve as the operational/reliability side of the team in order to provide a highly available & hands off deployment of our services. Both on-prem and cloud deployments are part of the game plan. If this person wants to be a Lead or a Manager there are future growth opportunities as well as Ctrl IQ grows.

Responsibilities:

Work closely with the development team.

Be part in architecture level discussions, planning, as well as implementation (lines of Go & Terraform code)
Research to ensure what we are building is always the best path forward
Document each project to facilitate integration for users
Drive proof of concepts and minimal viable products for demonstration
Release fast and release often software development mentality
Delivery of Infrastructure as Code

Skills that will help in general:

Friendly, collaborative, humble, honest, and always striving to be better
Excellent communication skills
Ability to work independently as well as collaboratively in a remote team environment
Identify, analyze, and resolve complex software design problems
Contributions to open source software projects
Experience with Kubernetes
Experience with Go
Experience with Terraform

Required for the SRE:

Excellent communication skills
Cloud Experience (AWS/Azure/GCP)
Linux fluency
3+ years of SRE/related experience: this shouldn’t be your first rodeo.
2+ years programming experience (A prior role as SWE would be ideal)

We currently offer full benefits (medical, dental, and vision - medical coverage for both employees and their dependents is 80% employer/20% employee) to all of our regular full-time U.S. based employees along with bonuses, stock options and a flexible hours/time-off policy.

Remote work, no required travel for most positions.