Site Reliability Engineer

 

The Site Reliability Engineer will be responsible for keeping all user-facing services and other Parler production systems running smoothly. This SRE role is a blend of operations and engineering which will require a skillset more advanced than a simple monitoring role.

Parler is a highly trafficked social media website and mobile app built on a host of technologies. Up-time is of the utmost importance, of course, and the public nature of the system will require an SRE with curiosity and creativity as we absorb ever-increasing traffic.

 

As SRE you will:

·        Run our infrastructure with Ansible, Terraform, Chef, Saltstack, or Puppet and DevOps best practice

·        Make monitoring and alerting alerts on symptoms and not on outages.

·        Document every action so your findings turn into repeatable actions–and then into automation.

·        Design, build and maintain core infrastructure pieces that allow Parler scaling to support hundreds of thousands of concurrent users.

·        Improve the deployment process to make it as boring as possible

·        Debug production issues across services and levels of the stack

·        Be on a PagerDuty rotation to respond to Parler availability incidents and provide support for service engineers with customer incidents

·        Use your on-call shift to prevent incidents from ever happening

·        Have experience with Nginx, HAProxy, Docker, Kubernetes, Terraform, ProxySQL or similar technologies

·        Projects you could work on Coding infrastructure automation with Chef and Terraform

·        Develop a relationship with a product group, define their SLAs, share Parler.com data on those SLAs and improve their reliability

Areas of expertise/contribution for Leveling

·        Have (decently) strong programming skills in PHP  (creating tools / accessing data)

·        Ability to use Chef and Ansible to efficiently manage our infrastructure

·        Intermediate level Unix knowledge

·        Load balancing the application using proxying, image serving via Object Store and CDN, as well as containerizing our system for Kubernetes

·        Backend storage management and scaling

·        Disaster Recovery and High Availability strategy