Site Reliability Engineering Specialist
Posted 7 days 1 hour ago by BT Group
Permanent
Full Time
Other
Staffordshire, Birmingham, United Kingdom, B19 1
Job Description
Location: Snowhill, Birmingham, United Kingdom Recruiter: Djoice Silva Hiring Manager: Laura O'Connor (TDKO11 R) This role will be based in Birmingham - Hybrid working - 3 days in the office. At BT International, our purpose is to keep the world connected. As part of BT, we build on almost 180 years of innovation and expertise to deliver secure connectivity and digital services to some of the world's leading multinational businesses and organisations. Our customers trust us to safeguard their data, drive their digital transformation and keep their businesses running. With colleagues on the ground across the world and supporting customers wherever they need to operate, BT International offers a truly global experience. Whether it's about providing cloud connectivity, helping organisations collaborate, or enabling innovation in cybersecurity and digital services, you'll be part of a team that shapes how businesses succeed in a world that is being transformed by AI. If you have the drive and ambition to make an impact on a global stage, BT International is where it happens. About this role - Site Reliability Engineering Specialist - Global BTI Professionals were established as a progressive step toward the convergence of multiple domains across BT. We pride ourselves on delivering expert third line support to an extensive range of services; ensuring the required levels of availability are maintained. The team are widely recognised for getting things done, while making transformational improvements along the way. We do this by ensuring we have the right people to achieve our high ambitions.We are seeking a driven technical leader to join the unit. This role will specialise in system administration and server management with a curious approach towards automation. Candidates will be required to leverage their expertise in system administration and use software to transform the way we interact with networks. A key responsibility will be investigating and learning new technologies, so a desire to learn is equally as valuable as experience. What you'll be doing Network Delivery: Support the Implementation of flawless change into the live network, utilising automation and CI/CD pipelines. Network Monitoring: Configure, maintain, and monitor systems and network infrastructure to ensure optimal health, performance, and reliability. Automation Tools: Utilise tools such as Ansible to provision and manage infrastructure resources in a scalable and efficient manner. Technical Acumen: Apply your understanding of network principles to troubleshoot network faults within our systems and look at how you can optimise performance and enhance security across our infrastructure. Incident Management and Resolution: Be prepared to support a 365x24/7 callout, providing third line technical resolution covering an extensive range of technologies. Customer Focus: Be a technical expert who understands the end-to-end journey of our customers. Growth and Development: As a technically talented expert you should enhance the brand of the team and support those around you to be accountable and perform at their best. You'll have the following skills & experience You must have Experience in an ISP Environment : Proven experience in a fast-paced ISP setting, managing and troubleshooting large-scale networks. Sysadmin/Server Management: Strong skills in system administration, server management, and compute resources with experience in deploying and managing containerised applications using orchestration tools such as Kubernetes. Technical Proficiency : Strong understanding of network architecture, design, and implementation. Monitoring and Logging Solutions : Familiarity with monitoring and logging solutions such as Elastic search, Apache Kafka, and Prometheus. Programming Proficiency : Proficiency in at least one programming language, such as Python, Ansible or Go. Growth Mindset: S elf-driven attitude towards learning new skills and aiding the development of others. Desired Network Fundamentals : In-depth knowledge of network protocols, including BGP, IS-IS, and others. Vendor Hardware Expertise : Hands-on experience with hardware from leading vendors such as Cisco, Juniper, and Nokia. Continuous Application Deployments : Experience deploying new developments to live networks across a variety of infrastructure, including Virtual Machines, Routers, Switches, and Servers. Benefits • Competitive salary• 10% on target bonus• BT Pension scheme, minimum 5% Employee contribution, BT contribution 10%• 25 days annual leave (not including bank holidays), increasing with service• Huge range of flexible benefits including cycle to work, healthcare, season ticket loan• World-class training and development opportunities• Option to join BT Shares Saving schemes• Discounted broadband, mobile and TV packages• Access to 100's of retail discounts including the BT shop• On call allowances and overtime Why this job matters The Site Reliability Engineering Specialist independently executes activities that help ensures BT is in the best position to deliver the service performance, reliability and availability that internal and external customers expect, through enabling cross-team engineering discussions to achieve scalable, measurable, fault-tolerant, and cost-effective cloud services. What you'll be doing 1. Executes the implementation of new software development life cycle automation tools, frameworks, and code pipelines (continuous integration/continuous delivery pipelines whilst executing best practices with a focus on the re-use of application code, demonstrates consistent software delivery practices and produces continuous integration/continuous delivery platform solutions using Amazon Web Services cloud, infrastructure as code (IaC), GitOps, and container technologies 2. Coordinates a diverse team and creates the initial test schedule to deliver all aspects of testing to time, budget and quality targets, ensuring producing outlines of solutions and defining depth of testing required 3. Executes the implementation of automation technologies to ensure repeatability, eliminating toil, reducing mean time to detection and resolution and repair services 4. Proactively identifies and manages risk through regular assessment and diligent execution of controls and mitigations, proactively raising any concerns 5. Leads scale testing to measure, tune and optimise system performance 6. Executes metric/monitoring analysis that creates stability, security, and performance improvements 7. Designs, analyses, develops and troubleshoots highly-distributed large-scale production systems spanning on-prem and cloud-based hosting 8. Executes approaches that scale systems sustainably through mechanisms like automation and evolves systems by pushing for changes that improve reliability and velocity 9. Writes and delivers infrastructure as code software to improve the availability, scalability, latency, and efficiency of services 10. Implements robust monitoring and alerting systems and performs root cause analysis and post-mortems with an eye towards future prevention 12. Executes retrospective and preventive actions after each high severity production incident 13. Analyses complex systems from a reliability and resilience perspective and identifies sources of instability in distributed systems 14. Champions, continuously develops and shares with team knowledge on emerging trends and changes in site reliability engineering best practices and industry standards 15. Mentors other site reliability engineers, helping to improve the team's abilities by acting as a technical resource The skills you'll need TroubleshootingInfrastructure ConfigurationService AssuranceApplication Performance Monitoring