HPC Infrastructure and Support Engineer

3 weeks ago

london, United Kingdom asobbi Full time

About the Company A rapidly growing cloud provider is redefining high-performance computing with cutting-edge GPUaaS, delivering scalable, enterprise-grade AI infrastructure at unmatched efficiency. With deep ties to Nvidia, they’re quickly becoming a powerhouse in the US and Europe’s AI and ML ecosystem, providing solutions for HPC, AI, and deep learning workloads. Role Overview As the Principal HPC Support Engineer, you will play a pivotal role in maintaining and supporting high-performance computing environments on bare-metal infrastructure, primarily serving clients in research, higher education, and enterprise AI sectors. You will focus on both the software and networking aspects of HPC deployments, ensuring that large-scale GPU clusters remain operational, secure, and optimized for client needs. Key Responsibilities System Maintenance and Performance Optimization • Manage, maintain, and tune bare-metal HPC clusters running Linux-based operating systems (e.g., Fedora, Debian, Ubuntu). • Optimize Nvidia GPU compute environments, including CUDA, NCCL, and GPU resource management in multi-node HPC clusters. • Oversee high-speed networking configurations, including InfiniBand (Mellanox), RDMA, and Ethernet fabric tuning for low-latency HPC workloads. • Configure and fine-tune HPC schedulers (e.g., Slurm, OpenPBS, SGE) for optimal GPU workload distribution. • Implement containerization strategies (Podman, Docker) and orchestration platforms (K3s, Kubernetes) for managing distributed AI/ML workloads. Networking and Infrastructure Support • Configure, monitor, and troubleshoot high-performance network fabrics, ensuring low-latency, high-throughput communication between GPU nodes. • Deploy and maintain InfiniBand, RoCE, and high-speed Ethernet for HPC and AI clusters. • Collaborate with networking teams to optimize routing, switching, and load balancing for distributed computing environments. • Work closely with Nvidia engineers and system architects to implement GPUDirect Storage, NVLink, and Magnum IO for accelerated workloads. Security, Automation, and Monitoring • Maintain authentication and authorization systems such as Active Directory, OpenLDAP, and Keycloak. • Automate system provisioning and configuration using Ansible, Terraform, or other Infrastructure-as-Code tools. • Monitor system performance using Prometheus, Grafana, and ELK Stack, identifying and resolving bottlenecks in GPU workloads. • Implement security best practices for multi-tenant HPC clusters, ensuring compliance with industry standards. Troubleshooting and Client Support • Serve as the lead technical resource for diagnosing and resolving complex software, networking, and hardware issues in large-scale GPU clusters. • Analyze logs, conduct performance profiling, and debug CUDA, MPI, and RDMA-related issues. • Work closely with AI/ML research teams, cloud engineers, and enterprise clients to optimize workload performance. Collaboration and Process Improvement • Support the ongoing development of internal HPC test environments and customer POCs. • Work cross-functionally with Service Desk, Operations, and Service Delivery Management to ensure seamless service. • Provide technical documentation, training, and mentorship to junior team members.

HPC Engineer

2 days ago

Greater London, United Kingdom RED SAP Solutions Full time

OverviewWe are seeking an experienced and highly motivated High-Performance Computing (HPC) Engineer to join our team. The successful candidate will have a proven record of delivering robust HPC services and infrastructure, combined with the ability to work closely with the scientific and research community to optimise computational workflows.The role...
HPC Engineer

2 weeks ago

City Of London, United Kingdom RED SAP Solutions Full time

We are seeking an experienced and highly motivated High-Performance Computing (HPC) Engineer to join our team. The successful candidate will have a proven record of delivering robust HPC services and infrastructure, combined with the ability to work closely with the scientific and research community to optimise computational workflows. The role requires an...
Senior HPC Infrastructure Engineer

2 weeks ago

London, Greater London, United Kingdom Hays Full time £90,000 - £120,000 per year

Your new companyJoin a pioneering organisation at the forefront of AI and High Performance Computing (HPC) infrastructure. With a strong focus on innovation and ethical computing, this company is building scalable, GPU-optimised environments that support cutting-edge research and enterprise workloads.Your new roleThis is a fully remote, hands-on technical...
HPC Engineer

2 weeks ago

London, Greater London, United Kingdom RED Global Full time £60,000 - £90,000 per year

We are seeking an experienced and highly motivatedHigh-Performance Computing (HPC) Engineerto join our team. The successful candidate will have a proven record of delivering robust HPC services and infrastructure, combined with the ability to work closely with the scientific and research community to optimise computational workflows.The role requires an...
Senior HPC Engineer.

3 days ago

Greater London, United Kingdom Millennium Management Full time

Overview Senior HPC Engineer Millennium's Infrastructure organization is dedicated to designing, engineering, supporting, and managing a robust server estate, systems virtualization, and core enterprise services. We are seeking a Senior HPC Engineer for a hands-on technical leadership position to support Worldquant’s intiative of maintaining financial...
Senior HPC Engineer

1 week ago

London, Greater London, United Kingdom Millennium Full time £90,000 - £1,400,000 per year

Senior HPC EngineerMillennium's Infrastructure organization is dedicated to designing, engineering, supporting, and managing a robust server estate, systems virtualization, and core enterprise services. We are seeking a Senior HPC Engineer for a hands-on technical leadership position to support Worldquant's intiative of maintaining financial research...
HPC Platform Engineer

1 day ago

London Area, United Kingdom Paragon Alpha - Hedge Fund Talent Business Full time £100,000 - £120,000 per year

I'm working with a quant hedge fund, looking for a HPC Platform Engineer to join their infrastructure software team, to partner with quants and engineers to modernise and augment their HPC/Cloud infrastructure and optimise their trading and research tech stack.The infra is backed by HPC, FPGA and the finest silicon hardware, and they want an expert in the...
HPC Software Engineer

5 days ago

London Area, United Kingdom Paragon Alpha - Hedge Fund Talent Business Full time £80,000 - £120,000 per year

I'm working with a quant hedge fund, looking for a C++ Engineer to join their HPC software team, to partner with quants and engineers to modernise and augment their HPC infrastructure and optimise their trading and research tech stack.The infra is backed by HPC, FPGA and the finest silicon hardware, and they want an expert in the field of HPC to join the...
HPC Engineer

1 day ago

London, Greater London, United Kingdom Linux Recruit Full time £45,000 - £52,500 per year

SpecialismLinux EngineeringJob typePermanentLocationLondonSalary£45,000 - £52,500 per annumJoin an internationally renowned institute as it establishes a new High Performance Computing function to support world leading research. Your experience across HPC, Storage and GPUs will allow you to contribute to this innovative team building out a hybrid setup to...
Platform Engineer

4 weeks ago

London, United Kingdom Cloud People Full time

Platform Engineer – HPC, AI and ML Up to £80,000 plus benefits Onsite – Kensington, London Company and Role This is an opportunity to join a global technology and AI solutions provider delivering some of the most advanced computing platforms in the world. You will play a leading role in the design, build and long-term support of a next generation AI and...

Americas

Europe

Asia / Oceania

Africa

HPC Infrastructure and Support Engineer