The Resiliency team is part of the Production Engineering organization that builds, operates, and improves the heart of Shopify’s technical platform, and unlock the power of planet-scale infrastructure for all of Shopify’s merchants, buyers, and developers.
Shopify has many critical components, and sometimes they fail. Members of our Resiliency Team are the ones ensuring we can get back to green as fast as possible when that happens. Resiliency set the foundation for building and running resilient systems at Shopify. This is a team of engineers with in-depth operational knowledge of the entire Shopify stack, and who act as first responders and leaders during an incident.
Our job is to get to a resolution as quickly as possible, and guide teams to build a more resilient Shopify. We build whatever is necessary to quickly resolve incidents, and seek out ways to automate away the manual toil.
Commerce happens 24/7, and we are building out a globally distributed team that can respond whenever necessary. Our team hires across 4 different regions (APAC, North America West, North America East, and EMEA) in a follow-the-sun support model that also provides 24/7 coverage for incident management.
For this Lead / Staff Production Engineer role, we welcome remote candidates based anywhere in Hawaii or Pacific Time zone. Working hours skew toward Hawaii Standard Time (-10:00 UTC). Relocation is possible for the right candidate 🏄🏾♀️
What we can offer you:
- The opportunity to run Shopify’s planet scale systems by enabling engineering teams to create resilient systems.
- Work focusing on a unique set of interesting and challenging problems that can’t be easily found elsewhere.
- The flexibility to define what Resiliency and Site Reliability Engineering mean for Shopify.
The means to grow the capacity of our worldwide distributed site reliability engineering teams, and consult with other engineering groups on how to build low latency, highly resilient systems.
- A direct impact on our millions of merchants’ ability to generate revenue for their livelihood, their families, and their employees through the business they’ve built from the ground up on our platform.
- Potential relocation assistance to one of the regions the team operates in.
You’ll work on things like:
- Collaborating with high-calibre engineering teams across Shopify to help them create resilient systems.
- Acting as a force multiplier across and within engineering departments.
- Managing ongoing incidents, using your understanding of Shopify to involve the right teams and resolve as quickly as possible.
- Cleaning up the noise in our signals, ensuring we can get an understanding of the system and debug a problem easily.
- Responding to automated alerts and execute playbooks.
- Setting standards with teams for building resilient, debuggable systems.
- Ensuring we never fail for the same reason twice.
- Following up on each meaningful incident to ensure the appropriate learnings are extracted and teams know what to do next.
- Helping teams build tools to automate the toil of on-call duties.