Leave us your email address and we'll send you all the new jobs according to your preferences.

Team Lead - Site Reliability Engineering

Posted 1 hour 15 minutes ago by Arbuthnot Latham

Permanent

Full Time

Other

London, United Kingdom

Job Description

Team Lead - Site Reliability Engineering Arbuthnot Latham has been associated with banking since 1833. We combine private and commercial banking, wealth planning and investment management. We believe in traditional relationship and service-led banking powered by modern technology.

Job purpose

The Team Lead - Site Reliability Engineering is responsible for ensuring the effective and efficient running of the current NOC team with a view to transition to an SRE function over time.

The team is responsible for enabling innovation and velocity of change while ensuring system reliability focusing on the critical features and functionality within products and platforms. It collaborates with the business or product owners to prioritise operational requirements by defining service-level indicators (SLIs) and service-level objectives (SLOs) to monitor and optimise customer journey and experience. Its goal is to design and operate scalable resilient systems utilising software engineering principles. It brings skills and expertise to automating manual tasks (TOIL) in such areas as incident management, problem management, change management, and release management tasks, and provides operational insights through monitoring and observability; and other aspects involved in preparing and optimising automated delivery solutions.

To place the interests of customers at the centre of all activities, act in a way that is consistent with achieving good outcomes for consumers; and to comply with the FCA and PRA's Conduct Rules.

Key Responsibilities:

Lead, manage and motivate the team.
Ensure the team are following best practice across all disciplines.
Have oversight of team tasks including investigation, troubleshooting, diagnosis, resolution and recovery to minimise impact to services.
Audit the Engineers' calls and tickets for quality assurance and provide feedback and coaching as required.
Drive a culture of Customer Excellence and Continual Service Improvement within the team.
I dentify, develop, communicate, and implement process changes within the team.
Act as a point of escalation for the team.

SRE responsibilities:

Help define the SRE practice for the organisation, collaborate with other stakeholders to select the relevant SRE principles, define the objectives and measurements of the outcomes.
Collaborate with stakeholders such as product and platform owners, to define service level objectives (SLOs), and service-level indicators (SLIs) for system operations focused on the critical features of the customers journey and experience.
Track and manage reliability performance against agreed SLOs, in partnership with other IT teams or other stakeholders, and ensure systems continue to meet SLOs over time.
Ensure key stakeholders, product owners, and platform owners are informed of reliability concerns and their potential impact to the customer experience.
Provide expert knowledge on reliability approaches, to ensure our organisation achieves its goals and roadmap for reliability.
Champion reliability being treated as a feature in products and platforms and promote the concept across all phases of the software development life cycle.
Create dashboards and reports to communicate key metrics, to product owners and key stakeholders.
Design, code, test and deliver solutions to automate manual operation (i.e., "TOIL").
Participate in operations support and on-call rotation shifts, for SRE supported systems and products.
Participate in or lead problem management activities, including post-mortem incident analysis, and provision of technical insight, documented findings, outcomes and recommendations as part of a root cause analysis to troubleshoot priority incidents.
Implement automation to reduce probability and/or impact of problems recurring possible options could include automated incident response, enhanced monitoring, observability initiatives, automation to change and release management .
Identify, evaluate, and recommend monitoring and observability tools and diagnostic techniques to improve system observability and insights, including identification of requirements, nonfunctional requirements, design, implementation and operationalisation.
Participate in system design, platform management, capacity planning at launch reviews and sprint planning sessions, or product and platform architecture discussions. Ensure all operational requirements including availability, performance and disaster recovery are met.
Collaborate and share lessons learned regarding reliability, performance and incidents with all stakeholders.
Participate and exert influence in organisational learning initiatives such as communities of practice to share knowledge and foster a continuous learning and improvement mindset.
Support architects working on new solutions, including analysing requirements, supporting technical architecture activities, prototyping, designing and developing reusable infrastructure artifacts, testing, implementing, and preparing for ongoing support.
Train and mentor the team to ensure SRE best practices evolve and scale successfully in the organisation.
Shift working pattern - there is a requirement to work shifts and on call hours.

Risk:

Responsible for managing risks inherent to the role by diligently observing internal policies and procedures.
Act as a point of escalation for customers and internal stakeholders as required.

Key Interfaces:

IT Infrastructure team
Heads of departments
Vertical teams
Change Management team
All business areas across the Group
3rd party suppliers
IT Service Desk

Person Specification Knowledge/ Experience/Skills:

Line management/team leader experience
Understanding of software engineering principles (source control, versioning, code reviews, etc.)
Working in an environment that complies with ISO27001, NIST, CIS Benchmarks, PCIDSS amongst others
Leading root cause analysis and blameless postmortems in complex environments
Experience of communicating complex issues to senior stakeholders and technical teams.
Implementation of highly available and reliable systems, using multi-AZ and multiregional approaches
Expertise with monitoring and observability tools (e.g. SolarWinds, Datadog, Azure/AWS native tools)
Expertise with SLI/SLO management tools such as (ServiceNow)
Expertise with Incident ticketing and change management systems such as (ServiceNow, Ivanti)
Expertise with automated incident response tools such as (Pager Duty, ServiceNow)
Expertise with software development frameworks/languages (e.g., Java, PHP, Python, PowerShell)
Extensive knowledge of cloud ecosystems (e.g. AWS, Red Hat OpenShift, Oracle Cloud Infrastructure, Microsoft Azure)
Knowledge of DevOps tools, such as CI/CD tools (e.g., Azure DevOps, GitHub, GitLab, Jira, Harness, Jenkins)
Knowledge of Infrastructure-as-code approaches, role-specific automation tools and associated programming languages (e.g., AWS CloudFormation, Azure ARM, Hashi Corp Terraform, Progress Chef, Perforce Puppet)
Knowledge of Orchestration tools (e.g., Cloudify, env0, Morpheus Data, Pliant, RackN, Scalr, Spacelift, Terraform for Cloud) desirable
Cloud provider services (e.g., AWS, Azure, Oracle, regional providers)
Operating systems (e.g., Windows and Linux, including scripting experience)
Knowledge of scalable architectures, including APIs, microservices and PaaS desirable
Knowledge of architecting for resilience (e.g., HA, multi-AZ, multiregional, backup and recovery tools) desirable

Qualifications:

Bachelors or masters degree in computer science, information systems or a related field, or equivalent work experience
SRE foundation course completed, and qualification gained
Automation provider certifications
Team Working
Influencing Others
Performance Focus
Change Focus
Working Proactively
Problem Solving and Judgement

About Us Life, Work and Benefits

Arbuthnot Latham is committed to equal-opportunities for all staff and candidates. We embrace inclusion & diversity and understand why they are critical for the success of our business and people.

Agile working - (3 Days in London Office per week)
Competitive salary, pension & holiday allowance
BUPA Health cover
4x Life Assurance
Discretionary bonus
Market leading maternity/paternity and menopause policies

Data Privacy and Reasonable adjustments

We take keeping your data security seriously. For more detail on how we may keep your data please refer to our Privacy Notice

Reasonable adjustments : Please let us know of any adjustments or arrangements that you may need to help you apply to this role or that will help you during the recruitment process. If you wish to discuss any particular requirements or concerns you have because of a disability or medical condition please contact us . Information you provide about any disability or medical condition will remain confidential unless it is necessary to disclose it to other members of staff or outside agencies to ensure the health and safety of yourself and others, or to implement the adjustments you require . click apply for full job details

Email this Job

Apply Now

ShortList

Recommend to a friend