Site Reliability Engineer (SRE)

Site Reliability Engineer

The Site Reliability Engineer will be responsible for keeping all user-facing services and other Parler production systems running smoothly. This SRE role is a blend of operations and engineering which will require a skillset more advanced than a simple monitoring role.

Parler is a highly trafficked social media website and mobile app built on a host of technologies. Up-time is of the utmost importance, of course, and the public nature of the system will require an SRE with curiosity and creativity as we absorb ever-increasing traffic.

As SRE you will:

· Run our infrastructure with Ansible, Terraform, Chef, Saltstack, or Puppet and DevOps best practice

· Make monitoring and alerting alerts on symptoms and not on outages.

· Document every action so your findings turn into repeatable actions–and then into automation.

· Design, build and maintain core infrastructure pieces that allow Parler scaling to support hundreds of thousands of concurrent users.

· Improve the deployment process to make it as boring as possible

· Debug production issues across services and levels of the stack

· Be on a PagerDuty rotation to respond to Parler availability incidents and provide support for service engineers with customer incidents

· Use your on-call shift to prevent incidents from ever happening

· Have experience with Nginx, HAProxy, Docker, Kubernetes, Terraform, ProxySQL or similar technologies

· Projects you could work on Coding infrastructure automation with Chef and Terraform

· Develop a relationship with a product group, define their SLAs, share Parler.com data on those SLAs and improve their reliability

Areas of expertise/contribution for Leveling

· Have (decently) strong programming skills in PHP (creating tools / accessing data)

· Ability to use Chef and Ansible to efficiently manage our infrastructure

· Intermediate level Unix knowledge

· Load balancing the application using proxying, image serving via Object Store and CDN, as well as containerizing our system for Kubernetes

· Backend storage management and scaling

· Disaster Recovery and High Availability strategy