Senior SRE Engineer

2 weeks ago


London, Greater London, United Kingdom ORI Full time

Join our team as an HPC SRE

  • Manage, optimize, and ensure the reliability of high-performance computing environments
  • Be the go-to expert for all technical aspects of HPC infrastructure
  • Collaborate with cross-functional teams to drive innovations aligning with business objectives
  • Provide 24/7 support to maintain high availability and performance for HPC systems
  • Set up HPC clusters with DGX or HGX platforms, GPU Direct, and establish network optimization
  • Configure and troubleshoot Networking R&S hardware from Cisco, Juniper, or relevant vendors
  • Write, execute, and debug Ansible Playbooks for Cumulus Linux automation
  • Lead investigations into high-priority incidents and prepare Root Cause Analysis
  • Monitor data centre health checks, licensing, and life-cycle management upgrades
  • Utilize observability metrics tools to monitor system health and performance

Performance:

  • Continuously optimize the performance of HPC systems
  • Set and meet clear Service Level Objectives (SLOs) for reliability and performance
  • Define and monitor Service Level Indicators (SLIs) to ensure service quality

Requirements:

  • Bachelor's or Master's degree in Telecommunications, Computer Science, Electrical and Computer Engineering (ECE), or related field
  • 6+ years of proven experience in networking and data centre operations
  • Expertise in networking technologies including network protocols and topologies
  • Background in troubleshooting server hardware/firmware, Linux OS, and scripting
  • Experience with automated configuration management systems
  • Ability to handle high-pressure situations in HPC AI data centres

  • Senior SRE

    2 weeks ago


    London, Greater London, United Kingdom EF Education First Full time

    Job DescriptionEF is investing big in new software innovation products for the next generation of Education experiences. We want to reinvent Learning and drive new and engaging ways for Students and Teachers to get the best out of our platform. We're looking for like-minded individuals who love to grow and solve new and exciting problems. ROLE: We are...

  • Senior Sre

    2 weeks ago


    London, Greater London, United Kingdom StarRez Full time

    StarRez is a leading global proptech company with a strong differentiated market position focused on transforming the resident experience by providing the engagement solutions and insights critical to successful residential communities.Our team is committed to building software that positively impacts the lives of millions of residents each year. We're a...


  • London, Greater London, United Kingdom Nominet Full time

    Press Tab to Move to Skip to Content Link Engineering Manager - Site Reliability Engineering Location: London / Hybrid, GB Engineering Manager – Site Reliability Engineering Contract Type: Permanent Location: Hybrid (minimum 20% on-site in our London Shoreditch office) We're proud to be an Equal Opportunity and Affirmative Action Employer, and we're...


  • London, Greater London, United Kingdom Durlston Partners Full time £120,000 - £150,000

    Job Description Senior SRE - Boutique HFT - Up to £150k + Bonus Our client is a boutique HFT hiring a Senior SRE to work on their ULL infrastructure. Your role will consist in optimising the firm's core infrastructure to support ULL, 24/7 trading operations - You will spend the majority of the day coding over monitoring and implement reactive strategies to...

  • Senior Database SRE

    2 weeks ago


    London, Greater London, United Kingdom Sky Group Full time

    We believe in better. And we make it happen. Better content. Better products. And better careers. Working in Tech, Product or Data at Sky is about building the next and the new. From broadband to broadcast, streaming to mobile, SkyQ to Sky Glass, we never stand still. We optimise and innovate. We turn big ideas into the products, content and services...

  • Sre Engineer

    2 weeks ago


    London, Greater London, United Kingdom eFinancialCareers Full time

    TEKsystems is currently engaged with a financial services company to recruit Site Reliability Engineer. who will be responsible for delivering continuous improvement, automation and self-service offerings to operational teams across company.Primary: Develop software to make infrastructure services selfmanaging and selfservice Deliver continuous service...

  • DevOps / SRE Lead

    2 weeks ago


    London, Greater London, United Kingdom LinuxRecruit Full time

    We have an opportunity to Lead a team of SRE's responsible for building a new Kubernetes Product. You'll still be hands on, you'll have a background in AWS, Kubernetes and Terraform and you'll have an ability to code in Go.You'll enjoy working with a Software Engineering mindset, but you'll also enjoy building and maintaining Platforms. It's a pure DevOps...


  • London, Greater London, United Kingdom Prism Digital Full time

    Site Reliability Engineer (SRE) | GCP or AWS & Kubernetes | SaaS HealthTech100% RemoteAfter successfully placing several Engineers into the cloud team here, I am now on the lookout for another SRE to join the growing cloud team. If you are passionate about Site Reliability and you are ready for your next challenge, this 'Greenfield' projectand the future...


  • London, Greater London, United Kingdom NP Group Full time

    Start Date:ASAP My client is one of the leading absolute return/hedge fund managers, overseeing assets on behalf of institutional investors from around the world, including pension funds, endowments, insurance companies, government agencies, private banks and fund of funds. At least 5 years professional experience in a DevOps / SRE role # Experience building...

  • SRE / DevOps Lead

    2 weeks ago


    London, Greater London, United Kingdom LinuxRecruit Full time

    Moving jobs can cause apprehension, it can also be a worrying thinking who you might end up working with, are they good enough, do they follow the same principles as you, could you share a beer with them in an evening, or will they put your stapler in jelly.... In an unique turn of events, we're looking for two people, a Lead/Manager and a trusted...


  • London, Greater London, United Kingdom eFinancialCareers Full time

    The successful Network Site Reliability Engineer / Network SRE will be based in the heart of Mayfair, incumbents will not only receive unrivalled compensation packages, including an above-market base salary and excellent annual bonus scheme, butour client also offers flexible working hours, extensive medical benefits for both you and your family, 25+ days...

  • SRE Manager

    2 weeks ago


    London, Greater London, United Kingdom Vodafone Full time

    Location: London OR Newbury + *Hybrid Salary: Excellent basic salary plus bonus and Vodafone benefits Working Hours: Full time hours per week – Mon to Fri *Hybrid At Vodafone UK we believe that through collaboration and connection we can achieve great things. Our hybrid working approach allows our people to work both in the office and at home,...


  • London, Greater London, United Kingdom Experian Health Full time

    We're looking for a highly skilled and motivated Site Reliability Engineer (SRE) to join our Experian Data Quality team. As an SRE, you will be responsible for ensuring the reliability, performance, and scalability of our market leading suite of data management products, with an initial focus on observability to support incident resolution and drive...


  • London, Greater London, United Kingdom HOVER SENIOR LIVING COMMUNITY Full time

    Senior Site Reliability Engineer- Remote ClickHouse Published 10 Apr 2024 Share this job UK Remote Role Highlights GO SQL Data Governance Computer Science Distributed Systems SRE Site Reliability Security Operations Automation Database Tools, Libraries and Frameworks GCP ClickHouse AWS Docker Terraform Cisco Ansible Description As...


  • London, Greater London, United Kingdom Lloyds Banking Group Full time £85,255 - £127,300

    DevOps-SRE Lead Engineer at Lloyds Banking Group Location: London based, 2 days per week in the office and the rest from home Salary & Benefits: £85,255 to £127,300 per annum, plus annual personal bonus, 15% employer pension contribution, private medical insurance, 30 days holiday plus bank holidays About us: We are part of the Business and Client...


  • London, Greater London, United Kingdom Lloyds Banking Group Full time

    We support agile working Click here for more information on agile working options. Agile Working Options Agile Working Options Hybrid WorkingJOB TITLE:Site Reliability Engineer – Homes Platform LOCATION(S): Halifax or LeedsHOURS:[Full-time] Our work style is hybrid, which involves spending at least two days per week currently, or 40% of our time, at...


  • London, Greater London, United Kingdom Tec Partners Full time

    Job Title: Site Reliability Engineer (Software Dev Background) Type: Permanent Location: Fully remote Salary: 55-65K Our client are growing their team and are looking for a Site Reliability Engineer - (ideally from a software development / software engineering background)to contribute to the development and maintenance of our cloud infrastructure, help...


  • London, Greater London, United Kingdom Qurated Network Full time

    Job Description Site Engineering Manager | Cross-Border Payment Fintech We are working with the leading cross-border payments provider that went through an IPO last year and is now completing an extensive digital transformation. You will be responsible for keeping their new technology platforms available 24/7/365 by monitoring the Performance, Reliability,...


  • London, Greater London, United Kingdom NearTech Search Full time

    Senior Site Reliability Engineer (GCP, AWS, K8s), UK (remote), £120,000 + bens An extremely well-funded and fast-growing AI-Driven Data company are in need of a new (GCP & AWS) Senior Site Reliability Engineer to join their growing tech team. They have cultivated an extremely innovative culture and working environment where the team are encouraged to...


  • London, Greater London, United Kingdom eFinancialCareers Full time

    Site Reliability Engineers in Market Data at Bloomberg fill the mission-critical role of ensuring our complex, real-time enterprise product is healthy, automated, observable, and designed for reliability. We work at enormous scale - billions of financialticks are being processed every day - and we ingest, enrich, and deliver it to clients within...