Leave us your email address and we'll send you all the new jobs according to your preferences.

Team Lead - Site Reliability Engineering

Posted 1 hour 15 minutes ago by Arbuthnot Latham

Permanent
Full Time
Other
London, United Kingdom
Job Description
Team Lead - Site Reliability Engineering Arbuthnot Latham has been associated with banking since 1833. We combine private and commercial banking, wealth planning and investment management. We believe in traditional relationship and service-led banking powered by modern technology.

Job purpose

The Team Lead - Site Reliability Engineering is responsible for ensuring the effective and efficient running of the current NOC team with a view to transition to an SRE function over time.

The team is responsible for enabling innovation and velocity of change while ensuring system reliability focusing on the critical features and functionality within products and platforms. It collaborates with the business or product owners to prioritise operational requirements by defining service-level indicators (SLIs) and service-level objectives (SLOs) to monitor and optimise customer journey and experience. Its goal is to design and operate scalable resilient systems utilising software engineering principles. It brings skills and expertise to automating manual tasks (TOIL) in such areas as incident management, problem management, change management, and release management tasks, and provides operational insights through monitoring and observability; and other aspects involved in preparing and optimising automated delivery solutions.

To place the interests of customers at the centre of all activities, act in a way that is consistent with achieving good outcomes for consumers; and to comply with the FCA and PRA's Conduct Rules.

Key Responsibilities:
  • Lead, manage and motivate the team.
  • Ensure the team are following best practice across all disciplines.
  • Have oversight of team tasks including investigation, troubleshooting, diagnosis, resolution and recovery to minimise impact to services.
  • Audit the Engineers' calls and tickets for quality assurance and provide feedback and coaching as required.
  • Drive a culture of Customer Excellence and Continual Service Improvement within the team.
  • I dentify, develop, communicate, and implement process changes within the team.
  • Act as a point of escalation for the team.
SRE responsibilities:
  • Help define the SRE practice for the organisation, collaborate with other stakeholders to select the relevant SRE principles, define the objectives and measurements of the outcomes.
  • Collaborate with stakeholders such as product and platform owners, to define service level objectives (SLOs), and service-level indicators (SLIs) for system operations focused on the critical features of the customers journey and experience.
  • Track and manage reliability performance against agreed SLOs, in partnership with other IT teams or other stakeholders, and ensure systems continue to meet SLOs over time.
  • Ensure key stakeholders, product owners, and platform owners are informed of reliability concerns and their potential impact to the customer experience.
  • Provide expert knowledge on reliability approaches, to ensure our organisation achieves its goals and roadmap for reliability.
  • Champion reliability being treated as a feature in products and platforms and promote the concept across all phases of the software development life cycle.
  • Create dashboards and reports to communicate key metrics, to product owners and key stakeholders.
  • Design, code, test and deliver solutions to automate manual operation (i.e., "TOIL").
  • Participate in operations support and on-call rotation shifts, for SRE supported systems and products.
  • Participate in or lead problem management activities, including post-mortem incident analysis, and provision of technical insight, documented findings, outcomes and recommendations as part of a root cause analysis to troubleshoot priority incidents.
  • Implement automation to reduce probability and/or impact of problems recurring possible options could include automated incident response, enhanced monitoring, observability initiatives, automation to change and release management .
  • Identify, evaluate, and recommend monitoring and observability tools and diagnostic techniques to improve system observability and insights, including identification of requirements, nonfunctional requirements, design, implementation and operationalisation.
  • Participate in system design, platform management, capacity planning at launch reviews and sprint planning sessions, or product and platform architecture discussions. Ensure all operational requirements including availability, performance and disaster recovery are met.
  • Collaborate and share lessons learned regarding reliability, performance and incidents with all stakeholders.
  • Participate and exert influence in organisational learning initiatives such as communities of practice to share knowledge and foster a continuous learning and improvement mindset.
  • Support architects working on new solutions, including analysing requirements, supporting technical architecture activities, prototyping, designing and developing reusable infrastructure artifacts, testing, implementing, and preparing for ongoing support.
  • Train and mentor the team to ensure SRE best practices evolve and scale successfully in the organisation.
  • Shift working pattern - there is a requirement to work shifts and on call hours.
Risk:
  • Responsible for managing risks inherent to the role by diligently observing internal policies and procedures.
  • Act as a point of escalation for customers and internal stakeholders as required.
Key Interfaces:
  • IT Infrastructure team
  • Heads of departments
  • Vertical teams
  • Change Management team
  • All business areas across the Group
  • 3rd party suppliers
  • IT Service Desk
Person Specification Knowledge/ Experience/Skills:
  • Line management/team leader experience
  • Understanding of software engineering principles (source control, versioning, code reviews, etc.)
  • Working in an environment that complies with ISO27001, NIST, CIS Benchmarks, PCIDSS amongst others
  • Leading root cause analysis and blameless postmortems in complex environments
  • Experience of communicating complex issues to senior stakeholders and technical teams.
  • Implementation of highly available and reliable systems, using multi-AZ and multiregional approaches
  • Expertise with monitoring and observability tools (e.g. SolarWinds, Datadog, Azure/AWS native tools)
  • Expertise with SLI/SLO management tools such as (ServiceNow)
  • Expertise with Incident ticketing and change management systems such as (ServiceNow, Ivanti)
  • Expertise with automated incident response tools such as (Pager Duty, ServiceNow)
  • Expertise with software development frameworks/languages (e.g., Java, PHP, Python, PowerShell)
  • Extensive knowledge of cloud ecosystems (e.g. AWS, Red Hat OpenShift, Oracle Cloud Infrastructure, Microsoft Azure)
  • Knowledge of DevOps tools, such as CI/CD tools (e.g., Azure DevOps, GitHub, GitLab, Jira, Harness, Jenkins)
  • Knowledge of Infrastructure-as-code approaches, role-specific automation tools and associated programming languages (e.g., AWS CloudFormation, Azure ARM, Hashi Corp Terraform, Progress Chef, Perforce Puppet)
  • Knowledge of Orchestration tools (e.g., Cloudify, env0, Morpheus Data, Pliant, RackN, Scalr, Spacelift, Terraform for Cloud) desirable
  • Cloud provider services (e.g., AWS, Azure, Oracle, regional providers)
  • Operating systems (e.g., Windows and Linux, including scripting experience)
  • Knowledge of scalable architectures, including APIs, microservices and PaaS desirable
  • Knowledge of architecting for resilience (e.g., HA, multi-AZ, multiregional, backup and recovery tools) desirable
Qualifications:
  • Bachelors or masters degree in computer science, information systems or a related field, or equivalent work experience
  • SRE foundation course completed, and qualification gained
  • Automation provider certifications
  • Team Working
  • Influencing Others
  • Performance Focus
  • Change Focus
  • Working Proactively
  • Problem Solving and Judgement
About Us Life, Work and Benefits

Arbuthnot Latham is committed to equal-opportunities for all staff and candidates. We embrace inclusion & diversity and understand why they are critical for the success of our business and people.
  • Agile working - (3 Days in London Office per week)
  • Competitive salary, pension & holiday allowance
  • BUPA Health cover
  • 4x Life Assurance
  • Discretionary bonus
  • Market leading maternity/paternity and menopause policies
Data Privacy and Reasonable adjustments

We take keeping your data security seriously. For more detail on how we may keep your data please refer to our Privacy Notice

Reasonable adjustments : Please let us know of any adjustments or arrangements that you may need to help you apply to this role or that will help you during the recruitment process. If you wish to discuss any particular requirements or concerns you have because of a disability or medical condition please contact us . Information you provide about any disability or medical condition will remain confidential unless it is necessary to disclose it to other members of staff or outside agencies to ensure the health and safety of yourself and others, or to implement the adjustments you require . click apply for full job details
Email this Job