System Reliability Engineer

3 weeks ago


London, Greater London, United Kingdom ZipRecruiter Full time
Job Summary: System Reliability Engineer

We're seeking a highly skilled System Reliability Engineer to join our team as our first dedicated SRE/DevOps hire. This role offers an exciting opportunity to design, implement, and manage our infrastructure, CI/CD pipelines, and production operations from the ground up. You'll have autonomy in shaping our tech stack, defining best practices, and building scalable systems that will set the foundation for future engineering growth.

We're offering a competitive salary of $120,000 - $180,000 per year, depending on experience. If you thrive in startup environments and enjoy the blend of software engineering, operations, and infrastructure, we'd love to hear from you.

The ideal candidate should have 4+ years of experience in a DevOps, SRE, or related role with hands-on experience building and maintaining infrastructure. You should be proficient with Infrastructure as Code (IaC) tools and Kubernetes ecosystem tools, such as Terraform, Kubernetes, and FluxCD. Solid experience with Docker and Kubernetes for container management is also required.

Key Responsibilities:
- Set Up and Manage Infrastructure:
  - Design, build, and maintain a robust, cloud-based infrastructure on Azure
  - Develop and maintain infrastructure as code (IaC) using tools like Terraform
  - Have ownership of our system's reliability and scalability, laying a strong foundation for our engineering environment

- Deploy and Orchestrate Containers:
  - Use k8s and Docker to manage containerised applications, ensuring high availability, scaling, and resource optimisation
  - Set up and manage k8s clusters to support reliable and scalable infrastructure

- Develop CI/CD Pipelines:
  - Design and implement CI/CD pipelines to automate our build, test, and deployment processes
  - Collaborate with development teams to streamline code integration and ensure high-quality releases across the board

- Implement Monitoring and Incident Management:
  - Set up proactive monitoring, logging, and alerting systems to detect and resolve issues before they impact users
  - Develop and refine incident response protocols and conduct root cause analyses to continuously improve system reliability

Requirements:
- 4+ years of experience in a DevOps, SRE, or related role with hands-on experience building and maintaining infrastructure
- Expertise with a major cloud provider (preferably Azure)
- Proficiency with Infrastructure as Code (IaC) tools and Kubernetes ecosystem tools, such as Terraform, Kubernetes, and FluxCD
- Solid experience with Docker and Kubernetes for container management
- Knowledge of CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI, GitHub Actions) and experience setting up automated workflows
- Familiarity with monitoring and logging tools like Prometheus, Grafana, ELK stack, or DataDog
- Strong scripting skills (Python, Bash, or similar) for automation and tooling

Benefits:
- Competitive salary of $120,000 - $180,000 per year
- Opportunities for career growth and professional development
- Collaborative and dynamic work environment


  • London, Greater London, United Kingdom Apple Inc. Full time

    About the Role:">We are seeking a highly skilled Reliable Systems Engineer to join our team at Apple Inc. This individual will be responsible for designing and developing scalable and reliable systems that meet the needs of our customers.">Responsibilities:">Design and develop scalable and reliable systemsCollaborate with development teams to identify system...


  • London, Greater London, United Kingdom Leap29 Full time

    Leap29 is a leading provider of cloud implementation, application development, and managed services, with a strong focus on reliability and efficiency. As a Reliability Systems Engineer, you will be responsible for ensuring the seamless operation of their systems, meeting the high expectations of their clients.Responsibilities:Ensuring system reliability and...


  • London, Greater London, United Kingdom TRIA Full time £60,000 - £70,000

    TRIA is seeking a highly skilled System Reliability Engineer to join our team.Job Description:You will be responsible for designing, building, and maintaining scalable and reliable systems that meet the needs of our business.Develop and implement automation scripts using tools like Ansible or TerraformLiaise with the Platform team to ensure alignment with...


  • London, Greater London, United Kingdom Google Full time

    Job DescriptionAs a System Reliability Engineer at Google, you will play a critical role in ensuring the reliability and scalability of our systems. You will work closely with cross-functional teams to design, deploy, and operate large-scale systems that are fault-tolerant and highly available. Your expertise will help us build and maintain infrastructure...


  • London, Greater London, United Kingdom Google Full time

    We are seeking an experienced Site Reliability Systems Engineer to join our Site Reliability Engineering team at Google. In this role, you will be responsible for designing, building, and maintaining large-scale distributed systems that support Google's product portfolio.As a Site Reliability Systems Engineer, you will work closely with cross-functional...


  • London, Greater London, United Kingdom AYS System Full time

    Role OverviewWe are seeking an experienced Electrical Systems Design Engineer to join our team at AYS System. This is a fantastic opportunity for a motivated and skilled professional to take on new challenges and contribute to the success of our organization.Job DescriptionThe successful candidate will be responsible for designing, developing, and...


  • London, Greater London, United Kingdom Proactive Appointments Full time

    System Reliability EngineerEstimated Salary: $90,000 - $120,000 per year.About the Job:This System Reliability Engineer position involves overseeing daily operations, ensuring seamless network performance, and participating in disaster recovery and business continuity planning.Key Responsibilities:Prioritize and efficiently handle 'Keep-The-Lights-On' (KTLO)...


  • London, Greater London, United Kingdom AYS System Full time

    Role Summary:We are seeking an experienced Electrical Systems Specialist to join our team at AYS System.The ideal candidate will have a strong background in electrical engineering, with a focus on designing and developing high-voltage control systems.This is an excellent opportunity for a motivated professional to take their career to the next level and work...


  • London, Greater London, United Kingdom Arcus Search Full time

    About Arcus Search">Arcus Search is a leading organization in the financial services industry, based in London.">Job Summary">We are seeking a Senior Site Reliability Engineer to join our established technology team. This role will provide an exciting opportunity to work on cutting-edge infrastructure, middleware, and CI/CD systems, driving performance,...


  • London, Greater London, United Kingdom AVT Reliability Ltd Full time

    About AVT Reliability LtdWe are a leading company in the field of asset integrity and reliability. Our team is passionate about delivering high-quality services to our clients.Job SummaryThis is an exciting opportunity for a talented engineering graduate to join our Asset Integrity Division as a specialist. You will be responsible for supporting a diverse...


  • London, Greater London, United Kingdom Bumble Full time

    Job DescriptionWe are seeking a skilled System Reliability Architect to join our team at Bumble Inc. Estimated salary: $120,000 - $180,000 per year.About the RoleAs a System Reliability Architect, you will be responsible for ensuring the reliability, scalability, and performance of our software systems. This involves proactively managing, automating, and...


  • London, Greater London, United Kingdom Promote Project Full time

    We are seeking an experienced Reliable System Architect to join our team. As a Reliable System Architect, you will be responsible for designing and implementing a reliable and scalable system architecture that meets the needs of Neon's growing customer base.About the RoleDesign and implement a reliable and scalable system architecture that meets the needs of...


  • London, Greater London, United Kingdom EFG Full time

    EFG, a pioneer in the esports world, is looking for an exceptional System Reliability Specialist to enhance their technical capabilities. This position comes with an estimated annual salary of $150,000 - $220,000.About the Role:In this role, you will be responsible for ensuring the stability and performance of EFG's systems, including monitoring,...


  • London, Greater London, United Kingdom The Blackstone Group L.P. Full time

    **About the Role**We are seeking a highly skilled Site Reliability Engineer to join our team at The Blackstone Group L.P. in London.The successful candidate will be responsible for improving the reliability of systems and services across the firm, working closely with service owners to design, implement, and manage services for continuous improvements.Key...


  • London, Greater London, United Kingdom Google Full time

    About the RoleWe are seeking a highly skilled Cloud Systems Reliability Engineer to join our team at Google. As a key member of our Technical Infrastructure team, you will be responsible for designing, building, and maintaining large-scale, fault-tolerant systems that power our cloud platforms.Your primary focus will be on optimizing existing systems,...


  • London, Greater London, United Kingdom Apple Inc. Full time

    Apple Inc. is seeking an experienced Payment System Reliability Engineer to join our Wallets & Payments team in London. As a key member of our team, you'll be responsible for designing, deploying, and maintaining our payment systems to ensure they meet the highest standards of scalability, reliability, and security.Key ResponsibilitiesDesign and implement...


  • London, Greater London, United Kingdom BAE Systems Full time

    Job Title:We are seeking a highly skilled Reliability Engineer to join our team in Warton. As a Reliability Engineer, you will be responsible for undertaking reliability and maintainability analysis and engineering tasks to influence the design and future sustainment of platforms and systems.Your Key Responsibilities:Undertake reliability and maintainability...


  • London, Greater London, United Kingdom Anson McCade Full time £65,000 - £85,000

    System Reliability ExpertAnson McCade is seeking a seasoned System Reliability Expert to join our team. In this role, you will be responsible for ensuring the reliability, scalability, and performance of our systems and applications.About the JobYou will collaborate with engineers to identify and implement improvements in system architecture and...


  • London, Greater London, United Kingdom ENGINEERINGUK Full time

    Job SummaryThe role of a Robotics Systems Engineer in the Reliability and Automation Engineering Team involves working with cross-functional teams to drive the implementation and continuous improvement of world-class maintenance, repair, and supportability solutions for Amazon Robotics portfolio. You will analyze large-scale data from databases, PLCs,...


  • London, Greater London, United Kingdom Jump Trading Full time

    Job SummaryWe are seeking a highly skilled Reliability Engineer - Trading Systems to join our team at Jump Trading. As a key member of our operations team, you will be responsible for ensuring the smooth operation of our trading systems, providing technical support to our traders and developers, and contributing to the design and implementation of new...