Reliability Engineer
1 week ago
Where You Come In You will operate at the jagged edge where software meets hardware. Standard cloud providers abstract away the complexity; we embrace it. You will be responsible for maximizing efficiency from our heterogeneous fleet of NVIDIA and AMD accelerators. This role is about precision, performance, and the relentless pursuit of system optimization in a multi-vendor supercomputing environment.
What You Will BuildThe Bare Metal Stack: Manage and optimize the lifecycle of bare-metal servers, ensuring that our OS, drivers, and firmware are tuned for peak AI performance.High-Throughput Interconnects: Engineer the software configurations for our InfiniBand and RoCE fabrics, solving the intricate data movement challenges that define modern distributed training.Performance Diagnostics: Build the tooling to visualize what is happening inside the cluster, turning opaque hardware counters into actionable signals for debugging latency and throughput.
The Profile We Are Looking ForLow-Level Fluency: You are not afraid of the kernel. You understand interrupts, memory management, and how the OS interacts with peripheral devices.Hardware Curiosity: You understand that software doesn't run in a vacuum. You are interested in the physical constraints of GPUs, networking cards, and storage subsystems.First-Principles Reasoning: When a system behaves unexpectedly, you don't just restart it; you investigate the physics of the failure to ensure it is solved permanently.
-
Reliability Engineer
4 days ago
London, Greater London, United Kingdom JLL Full timeJLL empowers you to shape a brighter way. Our people at JLL and JLL Technologies are shaping the future of real estate for a better world by combining world class services, advisory and technology for our clients. We are committed to hiring the best, most talented people and empowering them to thrive, grow meaningful careers and to find a place where...
-
Site Reliability Engineer
2 weeks ago
London, Greater London, United Kingdom eMFusion Global Full time £60,000 - £120,000 per yearJob Opportunity: Freelance Site Reliability Engineer (Outside IR35)£ | Remote (UK-Based) | Occasional travel to Farnborough or HammersmithContract until 2026We're hiring two hands-on Site Reliability Engineers (SREs) to join a fast-moving platform team on a long-term contract. This role is ideal for engineers with strong coding skills who are comfortable...
-
Site Reliability Engineer
2 weeks ago
London, Greater London, United Kingdom La Fosse Full time £6,600 - £66,200 per yearContract Opportunity: Site Reliability Engineer (Azure & AWS)Location:UK (Hybrid/Remote)Rate:£550/day (Inside IR35)Contract Length:12 Months InitallyThe client is looking for ahighly skilled Site Reliability Engineer (SRE)with deep experience acrossAzure and AWSto take a lead role in migrating an existing on-premHPC solution into the Cloud. You'll be...
-
Product Reliability Engineer
1 week ago
London, Greater London, United Kingdom Pinpoint Full timeProduct Reliability EngineerDepartment: EngineeringEmployment Type: Full TimeLocation: RemoteReporting To: VP of EngineeringDescription Hi I'm Dom, VP of Engineering at Pinpoint.We're a high-growth HR tech startup building and selling software that helps in-house recruitment teams attract, hire, and onboard the right talent. Today, we have a strong...
-
Site Reliability Engineer
4 days ago
London, Greater London, United Kingdom Spait Infotech Private Limited Full timeJob Description — Site Reliability Engineer (Remote, UK, Permanent)Job Title: Site Reliability Engineer (SRE)Location: Remote (United Kingdom)Experience: 0 -10 yearsEmployment Type: Full-time, PermanentEligibility: Must be eligible to work full-time in UK.Key ResponsibilitiesMaintain and improve availability, performance, and reliability of production and...
-
Site Reliability Engineer
4 days ago
London, Greater London, United Kingdom -ea3a-4317-8f52-46b52766e55f Full timeJoin us in redefining the creator economy with AIFanvue is the fastest-growing creator monetisation platform in the creator economy. We are the leading AI-powered creator-first platform, designed to empower creators worldwide to directly monetise their audience. We're on a mission to redefine the creator economy by empowering creators to connect, share, and...
-
Reliability Engineer
2 weeks ago
London, Greater London, United Kingdom Digital Realty Global Full time £60,000 - £120,000 per yearDescriptionYour roleThe Engineer will provide a range of support which may include technical difficulties, working with vendors to overcome intrinsic issues, working with site operations teams to improve usage and efficiency aspects, and identifying any areas for improvement. This may include site specific improvements, region wide improvement programmes,...
-
Site Reliability Engineer
1 week ago
London, Greater London, United Kingdom Group Full time £40,000 - £80,000 per year**Site Reliability Engineer- UK**Optum is a global organisation that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture...
-
Site Reliability Engineer
1 week ago
London, Greater London, United Kingdom Ditto Full time £60,000 - £120,000 per yearAbout Ditto:Ditto is redefining how data moves at the edge. Our mission is to make it seamless for developers to build resilient, real-time applications, regardless of network conditions. Whether you're in a stadium, airplane, or remote military base, Ditto's peer-to-peer sync engine ensures devices stay connected and data stays consistent, even without...
-
Product Reliability Engineer
1 week ago
London, Greater London, United Kingdom Pinpoint Full timeDescriptionHi I'm Dom, VP of Engineering at Pinpoint.We're a high-growth HR tech startup building and selling software that helps in-house recruitment teams attract, hire, and onboard the right talent. Today, we have a strong foundation in place: a mature product, rapid growth, strong product-market fit, and happy customers.We're scaling fast - more...