Leave us your email address and we'll send you all the new jobs according to your preferences.
Lead HPC & AI Infrastructure Engineer
Posted 2 hours 24 minutes ago by Hays Specialist Recruitment
Your new company
Step into the future of computing with a trailblazing organisation at the intersection of AI innovation and High Performance Computing (HPC). This company is redefining scalable infrastructure, building GPU-optimised environments that power advanced research and enterprise workloads. With a strong commitment to ethical computing and technical excellence, they're shaping the next generation of AI platforms.
Your new role
This is a fully remote, hands-on technical leadership role where you'll architect and deliver large-scale HPC and AI infrastructure from the ground up. You'll be the driving force behind the design, deployment, and optimisation of high-performance clusters - collaborating with internal engineering teams, OEMs, and external suppliers to build robust, scalable systems.
Key responsibilities include:
- Designing end-to-end infrastructure solutions across compute, storage, and networking
- Producing detailed technical documentation: hardware specs, data centre layouts, cabling, power and cooling
- Installing and tuning Linux-based operating systems and configuring SLURM job schedulers
- Optimising high-speed networking technologies (Infiniband, RoCE)
- Automating deployments and maintenance using Ansible, Terraform, Bash, and Python
- Troubleshooting complex distributed systems and mentoring junior engineers
This is a rare opportunity to lead infrastructure projects that directly support cutting-edge AI research and development. If you thrive in technically challenging environments and enjoy building systems that scale, this role is for you.
What you'll need to succeed
- Proven experience designing and scaling large HPC clusters (hundreds to thousands of nodes)
- Strong SLURM configuration skills - partitions, priorities, resource management
- Advanced Linux administration and performance tuning
- Expertise in high-performance networking (Infiniband, RoCE, RDMA)
- Experience with distributed file systems (Lustre, Ceph, WEKA, VAST)
- Proficiency in automation and Scripting (Ansible, Terraform, Bash, Python)
- A solid understanding of monitoring, resilience, and security compliance
- Excellent documentation skills and a passion for mentoring and knowledge sharing
Desirable Experience
- Containerisation in HPC (Singularity, Docker, Apptainer)
- Familiarity with AI/ML workflows, GPU-aware MPI, NVLink
- Experience in cloud, academic, or research environments
- Vendor hardware validation and data centre planning
What you'll get in return
- Share options and long-term incentives
- Unlimited holiday policy
- 100% remote working with flexible hours
- A culture of internal promotion and career development
- A collaborative, forward-thinking team
- Enhanced family-friendly policies
- A truly flexible and supportive workplace
What you need to do now
If you're interested in this role, click 'apply now' to forward an up-to-date copy of your CV, or call us now.
Hays Specialist Recruitment Limited acts as an employment agency for permanent recruitment and employment business for the supply of temporary workers. By applying for this job you accept the T&C's, Privacy Policy and Disclaimers which can be found on our website.
Hays Specialist Recruitment
Related Jobs
Nights Team Member
- £12.87 Hourly
- Dumfriesshire, Gretna Green, United Kingdom, DG165
Catering Manager
- £27,000 - £29,000 Annual
- Herefordshire, Much Birch, United Kingdom, HR2 8DA
HGV Class 2 Refuse Driver
- £13.39 - £26.78 Hourly
- Worcestershire, Pershore, United Kingdom, WR101
Network Engineer - Cisco ACI, LAN, WAN, AWS or Azure, Finance
- London, United Kingdom
Senior Java Low Latency Algo SME - Java, Low Latency, e Trading, Rates, FX, Lmax, Aeron
- London, United Kingdom