Reliability Engineer

3 weeks ago


London, Greater London, United Kingdom xAI Full time

About xAI's Distributed Systems Team


The xAI London team is a team of software engineers with a focus on building high-quality, scalable and reliable distributed systems. Our team works on various levels of the stack, from build systems and production backend infrastructure to frontend development. We focus on solving complex problems the right way and aren't afraid to delve into technically challenging topics to achieve high-quality software.


About the Role


We're looking for an experienced Site Reliability Engineer to join our dynamic team. The main responsibilities for this role are:



  • Improving our observability by adding or adjusting metrics
  • Building easily parsable dashboards using monitoring technologies such as Prometheus and Grafana
  • Designing and overseeing on-call rotations to ensure high system availability
  • Improving our deployment process to increase system reliability

An ideal candidate should have at least the following qualifications:



  • Expert knowledge of at least one programming language that compiles to machine code, such as Rust, C++ or Go
  • Expert knowledge of monitoring technologies
  • Expert knowledge of deployment technologies, such as Pulumi or Terraform
  • Expert knowledge of Kubernetes

Location


The role is based in our London office near Piccadilly Circus underground station. We work from the office 5 days a week, but allow for work-from-home days when needed. Candidates must be willing to attend late meetings to coordinate with our team in California and participate in semi-regular business trips to California.


Interview Process


After submitting your application, our team reviews your CV and statement of exceptional work. If your application passes this stage, you'll be invited to a 15-minute interview, followed by a series of technical interviews, including a coding interview, monitoring & deployment design interview, distributed systems design interview and a presentation about your most difficult technical problems solved.


Benefits



  • Competitive cash-based compensation
  • xAI equity
  • Private health and dental insurance
  • Unlimited time off subject to prior approval


  • London, Greater London, United Kingdom AVT Reliability Ltd Full time

    About AVT Reliability LtdWe are a leading company in the field of asset integrity and reliability. Our team is passionate about delivering high-quality services to our clients.Job SummaryThis is an exciting opportunity for a talented engineering graduate to join our Asset Integrity Division as a specialist. You will be responsible for supporting a diverse...

  • Reliability Engineer

    1 month ago


    London, Greater London, United Kingdom newscientist - Jobboard Full time

    West London (hybrid working)We are seeking a skilled Reliability Engineer to join our team and contribute to the development of our RAMS engineering capabilities.The successful candidate will have a strong background in reliability engineering and a passion for ensuring the safety and maintainability of complex systems.Key Responsibilities:Develop and...


  • London, Greater London, United Kingdom Victrex Full time

    Senior Reliability Engineer RoleAbout the JobWe are seeking an experienced Senior Reliability Engineer to lead our asset management strategy and drive improvements in plant performance across all UK plants.Job SummaryThe successful candidate will be responsible for developing and implementing systems and procedures that enhance safety, asset availability,...


  • London, Greater London, United Kingdom AWE Full time

    AWE is seeking a Reliability Engineering Manager to lead the delivery of engineering services across the lifecycle of an asset. The successful candidate will have a background in reliability engineering management or maintenance and alteration of plant-based engineering projects.Key responsibilities include:Providing leadership to Maintenance & Reliability...


  • London, Greater London, United Kingdom https:www.energyjobline.comsitemap Full time

    Reliability Systems EngineerAWE is seeking a skilled Reliability Systems Engineer to join our Dependability Team. The successful candidate will provide specialist systems engineering support to technically challenging projects, ensuring the reliability and availability of our core products.The ideal candidate will have a deep understanding of dependability...


  • London, Greater London, United Kingdom Florida Crystals ASR Group Full time

    DESCRIPTIONS2: Job Overview">As a Maintenance Engineer at Tate & Lyle Sugars, you will be responsible for maintaining the efficiency and reliability of our plant and equipment.">Responsibilities">Perform routine maintenance tasks to prevent equipment failure and downtime.Conduct root cause analysis to identify and resolve equipment issues.Develop and...

  • Reliability Engineer

    4 weeks ago


    London, Greater London, United Kingdom The Sterling Choice Full time

    About the Role:As a Planned Maintenance Engineer at The Sterling Choice, you will play a pivotal role in ensuring the reliability and performance of equipment in our production facility. You will develop and implement maintenance strategies, conduct root cause analysis, and optimize preventive maintenance to reduce downtime and boost productivity.Key...


  • London, Greater London, United Kingdom Viasat Full time

    Job Title: Digital Reliability EngineerJob Summary: We are seeking a Digital Reliability Engineer to join our platform team at Viasat. The successful candidate will be responsible for ensuring the reliability and resilience of our cloud-based systems.Lead the design and implementation of cloud-based solutions to enhance platform reliability and...


  • London, Greater London, United Kingdom 83zero Full time

    Job Description:We are seeking a skilled Cloud Reliability Engineer to join our team at 83zero, a global leader in digital services. As a Cloud Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and efficiency of our clients' platforms.Your Responsibilities:Ensure the reliability, scalability, and efficiency of clients'...


  • London, Greater London, United Kingdom Cutover Full time

    Cutover is a pioneering enterprise that has developed the world's first work orchestration and observability platform, enabling seamless collaboration between humans and machines.We're looking for a skilled Reliability Engineer to join our team and ensure the robustness and performance of our Cutover Enterprise platform.The platform features a ReactJS...


  • London, Greater London, United Kingdom Fourier Full time

    Key ResponsibilitiesAs a Site Reliability Engineer at Fourier, you will be responsible for designing and implementing tools to enhance the reliability and resilience of our production systems. This includes investigating failures, improving system performance, and automating manual processes.Required SkillsExcellent Python scripting skillsExperience with...


  • London, Greater London, United Kingdom Curve Full time

    About the RoleWe are seeking a skilled Site Reliability Engineer to join our team at Curve. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our infrastructure, identifying areas for improvement, and implementing solutions to optimize our systems.Key responsibilities include:Collaborating with...


  • London, Greater London, United Kingdom GoCardless Full time

    About the RoleWe are seeking an experienced Cloud Reliability Engineer to join our distributed team at GoCardless. As a key member of our engineering team, you will be responsible for designing and implementing scalable and reliable infrastructure solutions.With a strong interest in infrastructure management and site reliability engineering, you will...


  • London, Greater London, United Kingdom Fourier Full time

    Key ResponsibilitiesWe are seeking a skilled Site Reliability Engineer to join our team at Fourier. As a member of our Site Reliability Engineering team, you will be responsible for developing tools for surveillance and enhancement of our production systems.Key responsibilities include increasing system resilience, investigating failure, and improving...


  • London, Greater London, United Kingdom Trevett Project Services Full time

    Job Title: Senior Reliability EngineerJob Summary:We are seeking a Senior Reliability Engineer to join our team at Trevett Project Services. As a key member of our maintenance operations team, you will be responsible for ensuring the reliability and efficiency of our equipment and systems.Key Responsibilities:Provide technical support to engineers and...


  • London, Greater London, United Kingdom ESL FACEIT Group Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at ESL FACEIT Group. As a key member of our infrastructure team, you will be responsible for designing, analyzing, and troubleshooting large-scale distributed systems.As a Site Reliability Engineer, you will work closely with our software engineering teams to deploy and...


  • London, Greater London, United Kingdom Apple Full time

    About the RoleWe're seeking a seasoned Site Reliability Engineering Manager to lead our team of engineers responsible for the reliability and performance of our on-prem and cloud-based services. As a key member of our Apple Services Engineering team, you will be responsible for managing staging and production environments, promoting observability of systems,...


  • London, Greater London, United Kingdom Amazon UK Services Ltd. Full time

    Amazon UK Services Ltd. is seeking a skilled Reliability Maintenance Engineering Technician to join our team. As a key member of our Reliability Maintenance Engineering (RME) team, you will play a vital role in maintaining the reliability and efficiency of our equipment and workspaces.**Key Responsibilities:**- Perform proactive and preventative maintenance...


  • London, Greater London, United Kingdom Transport for London Full time

    Job Title: Senior Reliability EngineerAbout the Role:We are seeking a highly skilled Senior Reliability Engineer to join our team at Transport for London. As a key member of our RAM Engineering team, you will be responsible for leading the specification and delivery of Reliability, Availability, and Maintainability (RAM) activities to achieve the operational...


  • London, Greater London, United Kingdom Selby Jennings Full time

    About Selby JenningsWe're a leading global financial services firm where technologists and investment professionals collaborate to drive innovation and operational excellence.About the RoleAs a Site Reliability Engineer, you'll apply your expertise in software and systems engineering to design, build, and maintain our robust infrastructure. You'll reduce...