Leave us your email address and we'll send you all the new jobs according to your preferences.

Lead Site Reliability Engineer

Posted 5 hours 32 minutes ago by Cancer Research UK

£100,000 - £125,000 Annual

Permanent

Full Time

Other

Warwickshire, Stratford-upon-avon, United Kingdom, CV370

Job Description

. Modern tech-stack. Hybrid infrastructure. Reliability for 4,000+ users. Lead Site Reliability Engineer £64,000 - £74,000 (+ ) Grade: P3MPReports to: Senior Manager, Platform Engineering Contract: Permanent Hours: Full time 35 hours per week Location: Stratford, London. Office-based with high flexibility (1-2 days per week in the office) Visa sponsorship: Cancer Research UK can consider visa sponsorship for this vacancy. If this applies to you, please ensure that this is clearly marked on your application. Closing date: 16 February :55This vacancy may close earlier if a high volume of applications is received or once a suitable candidate is found, therefore we strongly recommend that you apply early to avoid disappointment. If you require more time to apply as part of a reasonable adjustment, please contact as soon as possible. Recruitment process: Telephone interview followed by two competency-based interviews Interview date: From the week commencing 23 February 2026We operate an anonymised shortlisting process in our commitment to equality, diversity, and inclusion. CVs are required for all applications; but we won't be able to view them until we invite you for an interview. Instead, we ask you to fully complete the work history section of the online application form for us to be able to assess you quickly, fairly, and objectively. At Cancer Research UK, we exist to beat cancer. We are professionals with purpose, beating cancer every day. But we need to go much further and much faster. That's why we're looking for someone talented, someone who wants to develop their skills, someone like you.Cancer Research UK has an ambitious Engineering Strategy supported by a modern and a complex hybrid infrastructure spanning on premise and multi cloud environments.As a Lead Site Reliability Engineer, you'll play a vital role in shaping and advancing SRE practices across the charity. You'll lead incident response, drive automation to reduce operational toil, and act as the escalation point for complex production issues. You'll define meaningful Service Level Objectives, strengthen observability, and help foster a blameless, learning focused culture that continually improves reliability.You'll also lead and develop a team of Site Reliability Engineers, balancing day to day operational needs with engineering work that delivers long term improvements. Working closely with development teams and Platform Engineering colleagues, you'll embed SRE principles across our services, coaching engineers and influencing technical direction to ensure reliability is built in from the start.If you're an SRE leader who has strengthened large scale production systems across complex on premise and AWS environments, and you're passionate about developing and leading teams to drive meaningful change, we would love for you to join our mission. Ensuring the reliability, availability, and performance of Cancer Research UK's production services across AWS, on premise, and data centre environments. This includes: + Defining and monitoring Service Level Objectives (SLOs), error budgets, and reliability metrics. + Reducing incidents and operational toil through automation, engineering improvements, and continuous optimisation. Leading incident response, promoting a blameless culture, coordinating cross team response, and ensuring post-mortem and follow up actions drive long term improvement. Building and maintaining comprehensive monitoring, logging, alerting, and tracing capabilities. + Creating tools and dashboards that give teams clear visibility into system health, performance, and reliability and help them proactively identify issues. Collaborating closely with development teams, architects, and Platform Engineering colleagues to embed reliability, observability, and operability into service design. Advising on scalability, performance, capacity planning, and production readiness at scale. Driving automation and toil reduction through infrastructure as code, robust CI/CD pipelines, self service tooling, and the removal of manual operational tasks. Collaborating with the Head of Platform Engineering and peers to shape SRE strategy and practices across the organisation. Championing the adoption of SRE principles (including SLOs, error budgets, capacity planning, and the balance between reliability work and feature development). Using modern platform approaches (LaaS, PaaS, FaaS, containers, serverless) to balance reliability, agility, and cost effectiveness. Producing and maintaining high-quality documentation, ensuring production systems are understood, debugged, and operated by the team and promoting knowledge sharing. Defining and championing best practices for reliability, observability, incident management, and operational excellence across the organisation.Line Management: Line-managing and leading the SRE team (c.5 direct reports), coaching them to develop their skills and careers. Creating an inclusive and high-performing team culture that recognises success and retains talent within the team and wider function. Setting clear objectives and KPIs for the team. Balancing operational demands with engineering work to ensure the team can invest in automation, reliability improvements, and skills development. Mentoring engineers across Platform Engineering and development teams to strengthen operational capability and adopt SRE best practices. Supporting self service initiatives while ensuring strong governance around reliability, security, and cost management. Proven experience as a Lead Site Reliability Engineer, operating and improving large scale production systems across complex on premise and AWS cloud environments. + This includes troubleshooting performance issues, managing incidents, conducting post-mortems, and implementing lasting solutions that prevent recurrence. Expert in SRE best practices with strong AWS experience and a proven record of improving reliability and reducing toil through engineering solutions across networking, storage, databases, and platform services. Experience automating operational tasks and delivering self service capabilities using infrastructure as code and CI/CD tooling (e.g., Terraform, AWS CDK, Ansible, CloudFormation, GitHub Actions, GitLab CI). Effectively troubleshot and debugged Linux/ Unix systems using Python in line with security best practices. Strong observability experience (including Prometheus, Grafana, ELK/Splunk, Datadog, or CloudWatch), with the ability to design effective monitoring, alerting, and dashboards. Proficiency with containerisation and orchestration (Docker, Kubernetes, ECS/Fargate) and a solid understanding of microservices, distributed systems, and service mesh technologies. Background in leading engineering teams, with strong management and coaching skills and the ability to drive change and guide people through ambiguity and evolving business needs. Has successfully built credible and collaborative technical and non-technical stakeholder relationships with the ability to explain complex technical issues, balance competing priorities, and influence technical decisions.Our organisation values are designed to guide all that we do.Bold: Act with ambition, courage and determination Credible: Act with rigour and professionalism Human: Act to have a positive impact on peopleTogether: Act inclusively and collaborativelyWe're looking for people who can believe in and embody these organisation values and can use them to drive forward progress against our mission to beat cancer.If you're interested in applying and excited about working with us but are unsure if you have the right skills and experience we'd still love to hear from you.We create a working environment that supports your wellbeing and provide a generous benefits package,

Email this Job

Apply Now

ShortList

Recommend to a friend