HPC Infrastructure and Support Engineer

4 weeks ago


United Kingdom asobbi Full time

About the Company A rapidly growing cloud provider is redefining high-performance computing with cutting-edge GPUaaS, delivering scalable, enterprise-grade AI infrastructure at unmatched efficiency. With deep ties to Nvidia, they’re quickly becoming a powerhouse in the US and Europe’s AI and ML ecosystem, providing solutions for HPC, AI, and deep learning workloads. Role Overview As the Principal HPC Support Engineer, you will play a pivotal role in maintaining and supporting high-performance computing environments on bare-metal infrastructure, primarily serving clients in research, higher education, and enterprise AI sectors. You will focus on both the software and networking aspects of HPC deployments, ensuring that large-scale GPU clusters remain operational, secure, and optimized for client needs. Key Responsibilities System Maintenance and Performance Optimization • Manage, maintain, and tune bare-metal HPC clusters running Linux-based operating systems (e.g., Fedora, Debian, Ubuntu). • Optimize Nvidia GPU compute environments, including CUDA, NCCL, and GPU resource management in multi-node HPC clusters. • Oversee high-speed networking configurations, including InfiniBand (Mellanox), RDMA, and Ethernet fabric tuning for low-latency HPC workloads. • Configure and fine-tune HPC schedulers (e.g., Slurm, OpenPBS, SGE) for optimal GPU workload distribution. • Implement containerization strategies (Podman, Docker) and orchestration platforms (K3s, Kubernetes) for managing distributed AI/ML workloads. Networking and Infrastructure Support • Configure, monitor, and troubleshoot high-performance network fabrics, ensuring low-latency, high-throughput communication between GPU nodes. • Deploy and maintain InfiniBand, RoCE, and high-speed Ethernet for HPC and AI clusters. • Collaborate with networking teams to optimize routing, switching, and load balancing for distributed computing environments. • Work closely with Nvidia engineers and system architects to implement GPUDirect Storage, NVLink, and Magnum IO for accelerated workloads. Security, Automation, and Monitoring • Maintain authentication and authorization systems such as Active Directory, OpenLDAP, and Keycloak. • Automate system provisioning and configuration using Ansible, Terraform, or other Infrastructure-as-Code tools. • Monitor system performance using Prometheus, Grafana, and ELK Stack, identifying and resolving bottlenecks in GPU workloads. • Implement security best practices for multi-tenant HPC clusters, ensuring compliance with industry standards. Troubleshooting and Client Support • Serve as the lead technical resource for diagnosing and resolving complex software, networking, and hardware issues in large-scale GPU clusters. • Analyze logs, conduct performance profiling, and debug CUDA, MPI, and RDMA-related issues. • Work closely with AI/ML research teams, cloud engineers, and enterprise clients to optimize workload performance. Collaboration and Process Improvement • Support the ongoing development of internal HPC test environments and customer POCs. • Work cross-functionally with Service Desk, Operations, and Service Delivery Management to ensure seamless service. • Provide technical documentation, training, and mentorship to junior team members.



  • United Kingdom asobbi Full time

    About the Company A rapidly growing cloud provider is redefining high-performance computing with cutting-edge GPUaaS, delivering scalable, enterprise-grade AI infrastructure at unmatched efficiency. With deep ties to Nvidia, they’re quickly becoming a powerhouse in the US and Europe’s AI and ML ecosystem, providing solutions for HPC, AI, and deep...


  • United Kingdom Nscale Full time

    About NscaleNscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business...


  • United Kingdom Nscale Full time

    About Nscale Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business...


  • United Kingdom Nscale Full time

    About Nscale Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business...


  • United Kingdom Nscale Full time

    About NscaleNscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business...


  • United Kingdom Nscale Full time

    About NscaleNscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business...


  • Manchester, Lancashire, United Kingdom Hewlett Packard Enterprise Full time £90,000 - £160,000 per year

    HPC & AI Sales SpecialistThis role has been designed as 'Hybrid' with an expectation that you will work on average 2 days per week from an HPE office Who We Are: Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help companies connect, protect, analyze, and act on their data and applications wherever...


  • United Kingdom Telent Full time

    Installation Engineer When you join our Engineering Team at Telent, you'll be empowered to innovate and drive common solutions, working closely with technical experts who are proud of the impact their work makes. Come join a highperforming team doing complex and critical work. Help build and keep the nation's critical infrastructure connected and protected...


  • Manchester, United Kingdom Anaplan Full time £40,000 - £80,000 per year

    At Anaplan, we are a team of innovators focused on optimizing business decision-making through our leading AI-infused scenario planning and analysis platform so our customers can outpace their competition and the market.What unites Anaplanners across teams and geographies is our collective commitment to our customers' success and to our Winning Culture.Our...


  • Salford, Lancashire, United Kingdom Langland Consultants Full time £30,000 - £50,000 per year

    Job Description Infrastructure Engineer / Analyst / to £40k + bensNetworking, Switches, NetApp,Cisco UCS, Firewalls, VMWare, Hyper-V, SAN, Storage2nd / 3rd Line Infrastructure Engineer / Analyst/: Do you have an background in supporting major Infrastructure technology – around Networking, Storage and VMWare? Do you have a broad technical skillset but...