Highly Available System Reliability Engineer

7 days ago


London, Greater London, United Kingdom xAI Full time
About the Role

We are seeking an experienced Site Reliability Engineer to join our dynamic team in London. The ideal candidate will have a strong background in software engineering and a passion for ensuring high system availability.

The main responsibilities of this role include:

  1. Improving Observability: Design and implement monitoring systems to provide real-time insights into system performance.
  2. Building Reliable Alerts: Develop automated alerting systems to notify teams of potential issues before they impact users.
  3. Enhancing Deployment Process: Collaborate with the team to design and implement efficient deployment processes that minimize downtime.

In terms of requirements, we are looking for someone with:

  1. Expert-level knowledge of at least one compiled programming language (Rust, C++, or Go).
  2. Familiarity with monitoring technologies such as Prometheus, Grafana, and PagerDuty.
  3. Experience with deployment tools like Pulumi or Terraform.
  4. Strong understanding of Kubernetes.

As a member of our team, you can expect a competitive cash-based compensation package, xAI equity, private health insurance, and unlimited time off subject to prior approval. We strive to maintain a dynamic work environment with opportunities for growth and development.

With a focus on large-scale distributed systems, we aim to build high-quality software that solves complex problems. If you're passionate about reliability engineering and excited about the prospect of joining a forward-thinking company, we encourage you to apply.



  • London, Greater London, United Kingdom Google Full time

    Job DescriptionAs a System Reliability Engineer at Google, you will play a critical role in ensuring the reliability and scalability of our systems. You will work closely with cross-functional teams to design, deploy, and operate large-scale systems that are fault-tolerant and highly available. Your expertise will help us build and maintain infrastructure...


  • London, Greater London, United Kingdom Preply Inc. Full time

    Transforming Education with TechnologyAt Preply Inc., we're redefining the way people learn languages. Our platform connects learners and tutors from around the world, leveraging AI-powered tools and learning materials to create personalized experiences.About the PositionThe Cloud Infrastructure Specialist will play a crucial role in building and maintaining...


  • London, Greater London, United Kingdom Shorterm Group Full time

    Role Overview:The Reliability and Availability Specialist will be responsible for ensuring the optimal performance of our Class 345 Crossrail fleet of trains. This role involves high-level fault finding electrically and mechanically, providing technical advice on train systems engineering, and maintaining warranty issues. If you have a strong background in...


  • London, Greater London, United Kingdom loveholidays Full time

    We are a rapidly growing online travel agency with technology at the heart of our success.In 2022, we sent millions of people on their dream holiday. With a million visitors a day, our 100+ services handle 8k requests per second, while maintaining p95 search latency of 150ms.You will contribute to building reliable, performant, auto-scalable, and highly...


  • London, Greater London, United Kingdom Zensar Technologies Full time

    Zensar Technologies is a leading digital solutions and technology services company that partners with global organizations across industries in their digital transformation journey. We are seeking an experienced Highly Available Platform Engineer to join our team and lead the design and implementation of highly available and scalable infrastructure...


  • London, Greater London, United Kingdom ZipRecruiter Full time

    Job Summary: System Reliability EngineerWe're seeking a highly skilled System Reliability Engineer to join our team as our first dedicated SRE/DevOps hire. This role offers an exciting opportunity to design, implement, and manage our infrastructure, CI/CD pipelines, and production operations from the ground up. You'll have autonomy in shaping our tech stack,...


  • London, Greater London, United Kingdom Randstad Staffing Full time

    Job Description:A highly skilled System Reliability Engineer with expertise in Java is required to join our team. This exciting role will see you play a critical part in ensuring the reliability, availability, and performance of applications or systems built using Java technologies.Key Responsibilities:Application Performance Monitoring & Optimization: Use...


  • London, Greater London, United Kingdom TRIA Full time £60,000 - £70,000

    TRIA is seeking a highly skilled System Reliability Engineer to join our team.Job Description:You will be responsible for designing, building, and maintaining scalable and reliable systems that meet the needs of our business.Develop and implement automation scripts using tools like Ansible or TerraformLiaise with the Platform team to ensure alignment with...

  • Reliability Engineer

    4 weeks ago


    London, Greater London, United Kingdom loveholidays Full time

    About the RoleWe are seeking a highly skilled Reliability Engineer to join our team at LoveHolidays. As a key member of our infrastructure team, you will be responsible for ensuring the reliability and performance of our systems, which handle millions of users and thousands of requests per second.Our runtime architecture is Service Based and hosted on cloud...


  • London, Greater London, United Kingdom Apple Inc. Full time

    At Apple Inc., we're looking for a seasoned Site Reliability Engineering (SRE) manager to join our iCloud Services team.About the RoleWe're seeking an accomplished builder and leader of teams with a passion for SRE and a track record of delivering operational perfection at scale. As a key member of our SRE leadership team, you will shape the future of how we...


  • London, Greater London, United Kingdom Search Technology Full time £240,000 - £300,000

    Job Description:We are looking for a talented Site Reliability Engineer to join our team in London. As a Senior Site Reliability Engineer, you will be responsible for designing, implementing, and maintaining high-performing, scalable, and highly available trading systems.About the Team:You will be part of our Global Platform Engineering team, which is...


  • London, Greater London, United Kingdom Laraveldaily Full time

    Reliability Engineering for CybersecurityWe are seeking a highly skilled Cybersecurity SRE to join our team at Laraveldaily. As a senior member of our infrastructure team, you will be responsible for designing, implementing, and maintaining a reliable and secure cloud-based infrastructure.Your primary focus will be on ensuring the availability, scalability,...


  • London, Greater London, United Kingdom Oxford Knight Full time

    Oxford Knight is a leading player in the financial sector, leveraging advanced technology to inform our investment decisions. We're currently seeking an experienced Highly Available Systems Engineer to join our agile development environments team. This individual will be responsible for ensuring the smooth operation of our applications, from initial...


  • London, Greater London, United Kingdom Arrows Full time

    Arrows seeks a Cloud Reliability Engineer with expertise in modern DevOps practicesThe ideal candidate will have extensive experience with Kubernetes, Azure Container Apps, and Azure networking.To ensure the reliability and efficiency of our systems, we require a highly skilled engineer with strong coding and scripting skills.We are looking for an expert who...


  • London, Greater London, United Kingdom Zensar Technologies Full time

    Job Description:As a Site Reliability Engineer at Zensar Technologies, you will be responsible for designing, implementing, and maintaining highly available and scalable infrastructure to support our business applications. You will work closely with cross-functional teams to identify and resolve technical issues, ensuring the smooth operation of our...


  • London, Greater London, United Kingdom Google Full time

    We are seeking an experienced Site Reliability Systems Engineer to join our Site Reliability Engineering team at Google. In this role, you will be responsible for designing, building, and maintaining large-scale distributed systems that support Google's product portfolio.As a Site Reliability Systems Engineer, you will work closely with cross-functional...


  • London, Greater London, United Kingdom AYS System Full time

    Role OverviewWe are seeking an experienced Electrical Systems Design Engineer to join our team at AYS System. This is a fantastic opportunity for a motivated and skilled professional to take on new challenges and contribute to the success of our organization.Job DescriptionThe successful candidate will be responsible for designing, developing, and...


  • London, Greater London, United Kingdom ENGINEERINGUK Full time

    Job SummaryThe role of a Robotics Systems Engineer in the Reliability and Automation Engineering Team involves working with cross-functional teams to drive the implementation and continuous improvement of world-class maintenance, repair, and supportability solutions for Amazon Robotics portfolio. You will analyze large-scale data from databases, PLCs,...


  • London, Greater London, United Kingdom Tyk Technologies Full time

    Senior Site Reliability Engineer Job Description:At Tyk Technologies, we're passionate about building software that solves real-world problems. Our Site Reliability Engineers (SREs) play a critical role in empowering users with a rich feature set, high availability, and exceptional performance levels to pursue their missions.About the Role:We're seeking an...


  • London, Greater London, United Kingdom Arcus Search Full time

    Senior Site Reliability EngineerWe are looking for a highly skilled Senior Site Reliability Engineer to join our team at Arcus Search. As a key member of our engineering team, you will be responsible for ensuring the reliability, scalability, and performance of our infrastructure and applications.The estimated salary for this position is $140,000 - $200,000...