Infrastructure Monitoring
4 weeks ago
About the Company A rapidly growing cloud provider is redefining high-performance computing with cutting-edge GPUaaS, delivering scalable, enterprise-grade AI infrastructure at unmatched efficiency. With deep ties to Nvidia, they’re quickly becoming a powerhouse in the US and Europe’s AI and ML ecosystem, providing solutions for HPC, AI, and deep learning workloads.
As the Principal HPC Support Engineer, you will play a pivotal role in maintaining and supporting high-performance computing environments on bare-metal infrastructure, primarily serving clients in research, higher education, and enterprise AI sectors. You will focus on both the software and networking aspects of HPC deployments, ensuring that large-scale GPU clusters remain operational, secure, and optimized for client needs.
System Maintenance and Performance Optimization
• Manage, maintain, and tune bare-metal HPC clusters running Linux-based operating systems (e.g., Oversee high-speed networking configurations, including InfiniBand (Mellanox), RDMA, and Ethernet fabric tuning for low-latency HPC workloads.
• Implement containerization strategies (Podman, Docker) and orchestration platforms (K3s, Kubernetes) for managing distributed AI/ML workloads.
Networking and Infrastructure Support
• Configure, monitor, and troubleshoot high-performance network fabrics, ensuring low-latency, high-throughput communication between GPU nodes.
• Deploy and maintain InfiniBand, RoCE, and high-speed Ethernet for HPC and AI clusters.
• Collaborate with networking teams to optimize routing, switching, and load balancing for distributed computing environments.
• Work closely with Nvidia engineers and system architects to implement GPUDirect Storage, NVLink, and Magnum IO for accelerated workloads.
Security, Automation, and Monitoring
• Automate system provisioning and configuration using Ansible, Terraform, or other Infrastructure-as-Code tools.
• Monitor system performance using Prometheus, Grafana, and ELK Stack, identifying and resolving bottlenecks in GPU workloads.
• Troubleshooting and Client Support
• Serve as the lead technical resource for diagnosing and resolving complex software, networking, and hardware issues in large-scale GPU clusters.
• Analyze logs, conduct performance profiling, and debug CUDA, MPI, and RDMA-related issues.
• Work closely with AI/ML research teams, cloud engineers, and enterprise clients to optimize workload performance.
Support the ongoing development of internal HPC test environments and customer POCs.
• Work cross-functionally with Service Desk, Operations, and Service Delivery Management to ensure seamless service.
• Provide technical documentation, training, and mentorship to junior team members.
-
Infrastructure Monitoring Engineer
1 week ago
Manchester, United Kingdom GBV Full time £35,000 - £55,000 per yearJob Description My client in Greater Manchester has a contract requirement for a Infrastructure Monitoring Engineer.They are looking for someone with the ability to monitor multiple infrastructure monitoring systems? Are you someone who can help develop the operational procedures for triage and escalation of issues? If so, we are looking for a contract...
-
Infrastructure SCE
1 day ago
United Kingdom Ubique Systems Full timeInfrastructure SCE with deep expertise in Google Cloud Platform (GCP) networking, specializing in designing and operating secure, scalable, and high-performance cloud network architectures. Proven experience with VPC networks, subnets, Cloud Router, Cloud NAT, Shared VPC, VPC Peering, Private Service Connect, and Cloud Interconnect. Skilled in implementing...
-
HPC Infrastructure and Support Engineer
4 weeks ago
United Kingdom asobbi Full timeAbout the Company A rapidly growing cloud provider is redefining high-performance computing with cutting-edge GPUaaS, delivering scalable, enterprise-grade AI infrastructure at unmatched efficiency. With deep ties to Nvidia, they’re quickly becoming a powerhouse in the US and Europe’s AI and ML ecosystem, providing solutions for HPC, AI, and deep...
-
Infrastructure Engineer
1 week ago
PT; MT; BG, , United Kingdom Acuity Analytics Full time £60,000 - £80,000 per yearThe creative mind behind every project. Put your skills to the test to build solutions that continue to shape the world we live in. About Us Ascent has recently been acquired by Acuity Analytics. This is both a significant milestone for us and a tremendous opportunity for you. Acuity Analytics is a business with a strong global...
-
Consultant Infrastructure Developer
1 week ago
Manchester, United Kingdom ThoughtWorks Full time £45,000 - £90,000 per yearInfrastructure Developers take a multifaceted approach to helping clients achieve technical excellence by assessing challenges from both a technical and operational perspective. As consummate 'bringers of knowledge,' they take extra care to ensure their team and client understand operational requirements and take a shared responsibility for designing and...
-
Senior Infrastructure Engineer
1 week ago
United Kingdom Remote Aircall Full time £60,000 - £100,000 per yearAircall is a unicorn AI-powered customer communications platform used by 22,000+ companies worldwide to drive revenue, faster resolutions, and scale. We're redefining what a customer communications platform can be—by combining voice, SMS, WhatsApp, and AI into one seamless workspace. Our momentum comes from a simple but powerful idea: help every...
-
Infrastructure Engineer
1 week ago
Salford, Lancashire, United Kingdom Langland Consultants Full time £30,000 - £50,000 per yearJob Description Infrastructure Engineer / Analyst / to £40k + bensNetworking, Switches, NetApp,Cisco UCS, Firewalls, VMWare, Hyper-V, SAN, Storage2nd / 3rd Line Infrastructure Engineer / Analyst/: Do you have an background in supporting major Infrastructure technology – around Networking, Storage and VMWare? Do you have a broad technical skillset but...
-
Senior Infrastructure Engineer
4 weeks ago
United Kingdom Hamilton Barnes 🌳 Full timeSenior Infrastructure Engineer | Azure, Entra ID, PowerShell/Terraform | £60,000 | Remote (UK) We’re working with one of the UK’s most respected managed service providers — a business known for its technical depth, collaborative culture, and commitment to cloud innovation. They’re looking for a hands-on Senior Microsoft Engineer to own and evolve...
-
Lead Infrastructure Architect
2 weeks ago
Remote, United Kingdom Tandem Bank Full time £60,000 - £75,000 per yearJob Title: Lead Infrastructure Architect Working Pattern: Monday to Friday, 36.25 hours per week, with participation in a 24/7 on-call rota as required.Salary: £60,000 - £75000 and up to 20% bonus and benefits Location: Remote but commutable into London per business requirements. Shape the Future of Our Digital InfrastructureAt Tandem, we're not...
-
IT Infrastructure Manager
1 week ago
Nottingham, Hybrid ( in out), United Kingdom EMBS Digital Full time £60,000 - £80,000 per yearRole Checklist: Location: Nottingham / Hybrid 3 days in 2 outSalary: £60,000 + benefitsGrade: SLTBusiness: MSP This brand is well known for its exceptional customer service and its innovative approach to delivering managed IT services into key verticals, this organisation is growing steadily, challenging the bigger MSPs and winningAs Infrastructure...