Site Reliability Engineer
The Site Reliability Engineer will be responsible for keeping all user-facing services and other Parler production systems running smoothly. This SRE role is a blend of operations and engineering which will require a skillset more advanced than a simple monitoring role.
Parler is a highly trafficked social media website and mobile app built on a host of technologies. Up-time is of the utmost importance, of course, and the public nature of the system will require an SRE with curiosity and creativity as we absorb ever-increasing traffic.
As SRE you will:
· Run our infrastructure with Ansible, Terraform, Chef, Saltstack, or Puppet and DevOps best practice
· Make monitoring and alerting alerts on symptoms and not on outages.
· Document every action so your findings turn into repeatable actions–and then into automation.
· Design, build and maintain core infrastructure pieces that allow Parler scaling to support hundreds of thousands of concurrent users.
· Improve the deployment process to make it as boring as possible
· Debug production issues across services and levels of the stack
· Be on a PagerDuty rotation to respond to Parler availability incidents and provide support for service engineers with customer incidents
· Use your on-call shift to prevent incidents from ever happening
· Have experience with Nginx, HAProxy, Docker, Kubernetes, Terraform, ProxySQL or similar technologies
· Projects you could work on Coding infrastructure automation with Chef and Terraform
· Develop a relationship with a product group, define their SLAs, share Parler.com data on those SLAs and improve their reliability
Areas of expertise/contribution for Leveling
· Have (decently) strong programming skills in PHP (creating tools / accessing data)
· Ability to use Chef and Ansible to efficiently manage our infrastructure
· Intermediate level Unix knowledge
· Load balancing the application using proxying, image serving via Object Store and CDN, as well as containerizing our system for Kubernetes
· Backend storage management and scaling
· Disaster Recovery and High Availability strategy