Leave us your email address and we'll send you all the new jobs according to your preferences.

Lead HPC & AI Infrastructure Engineer

Posted 2 hours 24 minutes ago by Hays Specialist Recruitment

£130,000 Annual

Permanent

Not Specified

Temporary Jobs

Dorset, United Kingdom

Job Description

Your new company
Step into the future of computing with a trailblazing organisation at the intersection of AI innovation and High Performance Computing (HPC). This company is redefining scalable infrastructure, building GPU-optimised environments that power advanced research and enterprise workloads. With a strong commitment to ethical computing and technical excellence, they're shaping the next generation of AI platforms.

Your new role
This is a fully remote, hands-on technical leadership role where you'll architect and deliver large-scale HPC and AI infrastructure from the ground up. You'll be the driving force behind the design, deployment, and optimisation of high-performance clusters - collaborating with internal engineering teams, OEMs, and external suppliers to build robust, scalable systems.

Key responsibilities include:

Designing end-to-end infrastructure solutions across compute, storage, and networking
Producing detailed technical documentation: hardware specs, data centre layouts, cabling, power and cooling
Installing and tuning Linux-based operating systems and configuring SLURM job schedulers
Optimising high-speed networking technologies (Infiniband, RoCE)
Automating deployments and maintenance using Ansible, Terraform, Bash, and Python
Troubleshooting complex distributed systems and mentoring junior engineers

This is a rare opportunity to lead infrastructure projects that directly support cutting-edge AI research and development. If you thrive in technically challenging environments and enjoy building systems that scale, this role is for you.

What you'll need to succeed

Proven experience designing and scaling large HPC clusters (hundreds to thousands of nodes)
Strong SLURM configuration skills - partitions, priorities, resource management
Advanced Linux administration and performance tuning
Expertise in high-performance networking (Infiniband, RoCE, RDMA)
Experience with distributed file systems (Lustre, Ceph, WEKA, VAST)
Proficiency in automation and Scripting (Ansible, Terraform, Bash, Python)
A solid understanding of monitoring, resilience, and security compliance
Excellent documentation skills and a passion for mentoring and knowledge sharing

Desirable Experience

Containerisation in HPC (Singularity, Docker, Apptainer)
Familiarity with AI/ML workflows, GPU-aware MPI, NVLink
Experience in cloud, academic, or research environments
Vendor hardware validation and data centre planning

What you'll get in return

Share options and long-term incentives
Unlimited holiday policy
100% remote working with flexible hours
A culture of internal promotion and career development
A collaborative, forward-thinking team
Enhanced family-friendly policies
A truly flexible and supportive workplace

What you need to do now
If you're interested in this role, click 'apply now' to forward an up-to-date copy of your CV, or call us now.

Hays Specialist Recruitment Limited acts as an employment agency for permanent recruitment and employment business for the supply of temporary workers. By applying for this job you accept the T&C's, Privacy Policy and Disclaimers which can be found on our website.

Email this Job

Apply Now

ShortList

Recommend to a friend