Site Reliability Engineering Manager

Engineering Team | San Francisco, CA

Site Reliability Engineering (SRE) is a hybrid software/systems group who works with traditional software engineering, capacity engineering, and infrastructure teams to ensure that Dropbox runs smoothly. Managing a SRE team requires a high degree of technical mastery, the ability to brutally prioritize and execute, and a focus on growing teams both by recruiting and mentorship.

Responsibilities

Manage engineers working with infrastructure and product engineering teams. Example services may include our metadata storage infrastructure run on MySQL, Go based processes serving as rpc systems for frontend components, or frontend components themselves such as the Photos tab.
Understand technical architectures, failure domains, tooling/automation, product launch plans, disaster recovery/business continuity plans, and other issues. You will be asked to create plans for prioritizing technical and resourcing challenges within the infrastructure organization.
Partner with product management, network engineering, product engineering, and other related groups.
Help engineers develop their careers, assigning them to projects tailored to their skill levels, long-term skill development, personalities, and work styles.
Work closely and drive recruiting with a dedicated recruiting staff. This will include sourcing candidates, interviewing candidates, organizing Dropbox participation in conferences/events, and onboarding new employees.
Balance the need to "keep things running" with allocating time to long-term, high-impact projects.
Assess employee performance frequently by providing feedback on an ongoing basis, address under-performance, and recognize excellent performance.

Requirements

BS or MS in Computer Science, Engineering, or a related technical discipline or equivalent experience
At least three years of direct management experience in a technology company
Previous experience with hiring and performance management, including working with under-performers
Sound knowledge of Linux and TCP/IP networks
Ability to code well in at least one language
Above average knowledge of basic large-scale internet service architectures (such as load balancing, LAMP, CDNs)
Good understanding of how to think about data durability (think backups, max time to recovery, and generally how to avoid losing data at all costs)
Good communications skills
Lastly a very healthy understanding of what “We not I” means