Infrastructure Monitoring

4 weeks ago


United Kingdom asobbi Full time

About the Company A rapidly growing cloud provider is redefining high-performance computing with cutting-edge GPUaaS, delivering scalable, enterprise-grade AI infrastructure at unmatched efficiency. With deep ties to Nvidia, they’re quickly becoming a powerhouse in the US and Europe’s AI and ML ecosystem, providing solutions for HPC, AI, and deep learning workloads.

As the Principal HPC Support Engineer, you will play a pivotal role in maintaining and supporting high-performance computing environments on bare-metal infrastructure, primarily serving clients in research, higher education, and enterprise AI sectors. You will focus on both the software and networking aspects of HPC deployments, ensuring that large-scale GPU clusters remain operational, secure, and optimized for client needs.

System Maintenance and Performance Optimization
• Manage, maintain, and tune bare-metal HPC clusters running Linux-based operating systems (e.g., Oversee high-speed networking configurations, including InfiniBand (Mellanox), RDMA, and Ethernet fabric tuning for low-latency HPC workloads.
• Implement containerization strategies (Podman, Docker) and orchestration platforms (K3s, Kubernetes) for managing distributed AI/ML workloads.

Networking and Infrastructure Support
• Configure, monitor, and troubleshoot high-performance network fabrics, ensuring low-latency, high-throughput communication between GPU nodes.
• Deploy and maintain InfiniBand, RoCE, and high-speed Ethernet for HPC and AI clusters.
• Collaborate with networking teams to optimize routing, switching, and load balancing for distributed computing environments.
• Work closely with Nvidia engineers and system architects to implement GPUDirect Storage, NVLink, and Magnum IO for accelerated workloads.

Security, Automation, and Monitoring
• Automate system provisioning and configuration using Ansible, Terraform, or other Infrastructure-as-Code tools.
• Monitor system performance using Prometheus, Grafana, and ELK Stack, identifying and resolving bottlenecks in GPU workloads.
• Troubleshooting and Client Support
• Serve as the lead technical resource for diagnosing and resolving complex software, networking, and hardware issues in large-scale GPU clusters.
• Analyze logs, conduct performance profiling, and debug CUDA, MPI, and RDMA-related issues.
• Work closely with AI/ML research teams, cloud engineers, and enterprise clients to optimize workload performance.

Support the ongoing development of internal HPC test environments and customer POCs.
• Work cross-functionally with Service Desk, Operations, and Service Delivery Management to ensure seamless service.
• Provide technical documentation, training, and mentorship to junior team members.



  • Manchester, United Kingdom GBV Full time £35,000 - £55,000 per year

    Job Description My client in Greater Manchester has a contract requirement for a Infrastructure Monitoring Engineer.They are looking for someone with the ability to monitor multiple infrastructure monitoring systems? Are you someone who can help develop the operational procedures for triage and escalation of issues? If so, we are looking for a contract...

  • Infrastructure SCE

    1 day ago


    United Kingdom Ubique Systems Full time

    Infrastructure SCE with deep expertise in Google Cloud Platform (GCP) networking, specializing in designing and operating secure, scalable, and high-performance cloud network architectures. Proven experience with VPC networks, subnets, Cloud Router, Cloud NAT, Shared VPC, VPC Peering, Private Service Connect, and Cloud Interconnect. Skilled in implementing...


  • United Kingdom asobbi Full time

    About the Company A rapidly growing cloud provider is redefining high-performance computing with cutting-edge GPUaaS, delivering scalable, enterprise-grade AI infrastructure at unmatched efficiency. With deep ties to Nvidia, they’re quickly becoming a powerhouse in the US and Europe’s AI and ML ecosystem, providing solutions for HPC, AI, and deep...


  • PT; MT; BG, , United Kingdom Acuity Analytics Full time £60,000 - £80,000 per year

    The creative mind behind every project.  Put your skills to the test to build solutions that continue to shape the world we live in.  About Us    Ascent has recently been acquired by Acuity Analytics.  This is both a significant milestone for us and a tremendous opportunity for you.  Acuity Analytics is a business with a strong global...


  • Manchester, United Kingdom ThoughtWorks Full time £45,000 - £90,000 per year

    Infrastructure Developers take a multifaceted approach to helping clients achieve technical excellence by assessing challenges from both a technical and operational perspective. As consummate 'bringers of knowledge,' they take extra care to ensure their team and client understand operational requirements and take a shared responsibility for designing and...


  • United Kingdom Remote Aircall Full time £60,000 - £100,000 per year

    Aircall is a unicorn AI-powered customer communications platform used by 22,000+ companies worldwide to drive revenue, faster resolutions, and scale. We're redefining what a customer communications platform can be—by combining voice, SMS, WhatsApp, and AI into one seamless workspace. Our momentum comes from a simple but powerful idea: help every...


  • Salford, Lancashire, United Kingdom Langland Consultants Full time £30,000 - £50,000 per year

    Job Description Infrastructure Engineer / Analyst / to £40k + bensNetworking, Switches, NetApp,Cisco UCS, Firewalls, VMWare, Hyper-V, SAN, Storage2nd / 3rd Line Infrastructure Engineer / Analyst/: Do you have an background in supporting major Infrastructure technology – around Networking, Storage and VMWare? Do you have a broad technical skillset but...


  • United Kingdom Hamilton Barnes 🌳 Full time

    Senior Infrastructure Engineer | Azure, Entra ID, PowerShell/Terraform | £60,000 | Remote (UK) We’re working with one of the UK’s most respected managed service providers — a business known for its technical depth, collaborative culture, and commitment to cloud innovation. They’re looking for a hands-on Senior Microsoft Engineer to own and evolve...


  • Remote, United Kingdom Tandem Bank Full time £60,000 - £75,000 per year

    Job Title: Lead Infrastructure Architect    Working Pattern: Monday to Friday, 36.25 hours per week, with participation in a 24/7 on-call rota as required.Salary: £60,000 - £75000 and up to 20% bonus and benefits Location: Remote but commutable into London per business requirements. Shape the Future of Our Digital InfrastructureAt Tandem, we're not...


  • Nottingham, Hybrid ( in out), United Kingdom EMBS Digital Full time £60,000 - £80,000 per year

    Role Checklist: Location: Nottingham / Hybrid 3 days in 2 outSalary: £60,000 + benefitsGrade: SLTBusiness: MSP This brand is well known for its exceptional customer service and its innovative approach to delivering managed IT services into key verticals, this organisation is growing steadily, challenging the bigger MSPs and winningAs Infrastructure...