Site Reliability Engineer

4 weeks ago


London, Greater London, United Kingdom loveholidays Full time

About us

We are a dynamic and rapidly growing online travel agency that places technology at the heart of our success. With millions of people trusting us for their dream holidays, our focus is on delivering exceptional customer experiences through cutting-edge technology.

We operate at scale, handling 100+ services and 8k requests per second while maintaining a p95 search latency of 150ms. Our observability captures and processes 1TB of logs a day and 350k metric samples a second.

We rely heavily on open source and give back to the community through contributions to public repositories and open sourcing our own projects.

Responsibilities

As our first Site Reliability Engineer, you will contribute to the evolution of SRE practices like incident management, blameless postmortems, SLOs and error budgets. You will also contribute to building reliable, performant, auto-scalable and highly available systems.

  • Apply our existing technology through a SRE lens.
  • Level up SRE practices across teams.
  • Improve reliability KPIs of the platform.
  • Help balance reliability with feature delivery using SLOs and error budgets.

Your responsibility will be to help engineering teams succeed at operations, not to run their services for them.

What you'll be working on

  • Kick-start our SRE function by evangelising reliability best practices and processes.
  • Exposing slow running code paths in critical applications using tools like Java Flight Recorder or Go's pprof.
  • Writing tools or modifying existing applications with reliability and performance in mind.
  • Ensuring our systems and their individual components can withstand x10 load by improving our scalability.
  • Shortening mean time to discovery and recovery with improvements to observability and alerting.
  • Exposing system weaknesses with regular monitoring.

Our runtime architecture is Service Based and hosted on cloud infrastructure. Our engineering teams provision and manage their services' infrastructure with using and .

We place a strong focus on observability, continually evolving our monitoring and alerting stack, currently centred around the () ecosystem. Our service mesh provides uniform observability of all production services at 10s intervals.

Performance and scalability are integral to our software and infrastructure development process, achieved by combining Computer Science fundamentals and cutting edge cloud technologies.

You should have a good understanding of

  • Software development principles.
  • Performance, scalability.
  • HTTP, web services, REST.
  • Containers, cloud.
  • Testing, reliability, monitoring.
  • Linux.
  • Low-level debugging and troubleshooting.

What we'll give back to you

  • Company pension contributions at 5%.
  • Training budget for you to learn on the job and level yourself up.
  • Discounted holidays for you, your family and friends.
  • 25 days of holidays per annum (plus 8 public holidays) increases by 1 day for every second year of service, up to a maximum 30 days per annum.
  • Ability to buy and sell annual leave.
  • Cycle to work scheme, season ticket loan and eye care vouchers.


  • London, Greater London, United Kingdom Fourier Full time

    Key ResponsibilitiesAs a Site Reliability Engineer at Fourier, you will be responsible for designing and implementing tools to enhance the reliability and resilience of our production systems. This includes investigating failures, improving system performance, and automating manual processes.Required SkillsExcellent Python scripting skillsExperience with...


  • London, Greater London, United Kingdom ESL FACEIT Group Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at ESL FACEIT Group. As a key member of our infrastructure team, you will be responsible for designing, analyzing, and troubleshooting large-scale distributed systems.As a Site Reliability Engineer, you will work closely with our software engineering teams to deploy and...


  • London, Greater London, United Kingdom J Bandy Consulting Full time

    Job SummaryThe Site Reliability Engineer will be responsible for ensuring the reliability, scalability, and performance of our systems. This role requires a strong understanding of SRE best practices, expertise in Git and GitOps, and experience with logging and monitoring solutions.Key ResponsibilitiesDevelop and maintain the Site Reliability Engineering...


  • London, Greater London, United Kingdom JPMorganChase Full time

    About the RoleWe're seeking a skilled Senior Site Reliability Engineer to join our team at JPMorgan Chase. As a key member of our Accelerators Engineering team, you will play a crucial role in ensuring the reliability and scalability of our products.As a Senior Site Reliability Engineer, you will be responsible for creating high-quality designs, roadmaps,...


  • London, Greater London, United Kingdom Experian Full time

    About the RoleWe're seeking a skilled Site Reliability Engineer to join our Experian Data Quality team in London, working on a hybrid schedule.As a key member of our QA team, you'll ensure the reliability, performance, and scalability of our market-leading data management products, focusing on observability to support incident resolution and drive ongoing...


  • London, Greater London, United Kingdom Apple Full time

    Job SummaryAt Apple, we're looking for talented Site Reliability Engineers to join our team. As a Site Reliability Engineer, you'll play a critical role in ensuring the reliability and scalability of our services. You'll work closely with our development teams to design, build, and operate the systems and infrastructure that power our products and...


  • London, Greater London, United Kingdom Curve Full time

    Job DescriptionAt Curve, we're on a mission to simplify your finances and help you live inspired. We're looking for a talented Site Reliability Engineer to join our team and help us scale our platforms to meet the needs of millions of customers.The ideal candidate will have a strong background in cloud infrastructure, with experience deploying...


  • London, Greater London, United Kingdom STAND 8 Technology Services Full time $75 - $85

    About the RoleWe are seeking an experienced Site Reliability Engineer to support our systems focused on linear channel delivery and modernization efforts. The ideal candidate will be responsible for maintaining existing systems, working on infrastructure modernization, and supporting the streaming engineering team to ensure smooth operation of linear...


  • London, Greater London, United Kingdom Fourier Full time

    Key ResponsibilitiesAs a Site Reliability Engineer at Fourier, you will be responsible for developing tools to enhance and monitor production systems, increasing system resilience, investigating failures, and improving reliability.You will also be responsible for automating manual processes and remediating incidents in real-time.RequirementsExcellent Python...


  • London, Greater London, United Kingdom IO Associates Full time

    Job OpportunityIO Associates is seeking a highly skilled Site Reliability Engineer to join their team for a short-term project within the Law Enforcement sector.Monitor system performance and security to ensure optimal functionality.Collaborate with the team to identify and resolve technical issues.This role offers a competitive daily rate of up to £500 per...


  • London, Greater London, United Kingdom J Bandy Consulting Full time

    Job SummaryJ Bandy Consulting is seeking an experienced Site Reliability Engineer to join our team. The ideal candidate will have a strong background in software engineering and a passion for building scalable and reliable systems.Key ResponsibilitiesDevelop and implement automation tools to improve the efficiency of our systemsCollaborate with...


  • London, Greater London, United Kingdom Selby Jennings Full time

    About Selby JenningsWe're a leading global financial services firm where technologists and investment professionals collaborate to drive innovation and operational excellence.About the RoleAs a Site Reliability Engineer, you'll apply your expertise in software and systems engineering to design, build, and maintain our robust infrastructure. You'll reduce...


  • London, Greater London, United Kingdom GoCardless Full time

    The RoleGoCardless is looking for a Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for designing, implementing, and maintaining the infrastructure and systems that support our payment and open banking products.Key ResponsibilitiesDesign and implement scalable and efficient infrastructure solutionsDevelop...


  • London, Greater London, United Kingdom Preqin Full time

    About the Role:Preqin is seeking an experienced Site Reliability Engineer to join our team in London. As a Site Reliability Engineer, you will work across Preqin's full suite of services, supporting our clients around the world.You will be responsible for designing, building, and operating our infrastructure, middleware, and CI/CD systems to ensure our teams...


  • London, Greater London, United Kingdom Highfield Professional Solutions Ltd Full time

    Highfield Professional Solutions Ltd is seeking a Site Reliability Engineer to join our team in Central London. The successful candidate will be responsible for managing and maintaining critical engineering systems within our Data Centre, ensuring that they operate efficiently and effectively. This role offers a competitive salary of up to 48,000 per year,...


  • London, Greater London, United Kingdom Hamilton Barnes Associates Limited Full time

    Revolutionize Automation and ReliabilityYou'll have the opportunity to join a dynamic tech environment as a Site Reliability Engineer (SRE) working on the 'Platform as a Service' toolset. If you're passionate about enterprise Linux systems, networking, and automation, this role is tailor-made for you.Key Responsibilities:Contribute, build and maintain the...


  • London, Greater London, United Kingdom Kinetech Full time

    At Kinetech, we're seeking a talented Site Reliability Engineer to join our team. This role is responsible for ensuring the smooth operation of our software systems, with a focus on scalability, reliability, and performance.Key Responsibilities:Design and implement CI/CD pipelines to automate code integration, testing, and deployment.Automate repetitive...


  • London, Greater London, United Kingdom Mondrian Alpha Recruitment Solutions Full time

    At Mondrian Alpha Recruitment Solutions, we are seeking a highly skilled Site Reliability Engineer to join our team responsible for engineering and supporting the company's critical infrastructure platforms.This team handles the centralized development infrastructure and works alongside engineering teams across the business to ensure the optimal route of...


  • London, Greater London, United Kingdom Fourier Full time

    Key ResponsibilitiesWe are seeking a skilled Site Reliability Engineering Specialist to join our team at Fourier. As a key member of our Site Reliability Engineering team, you will be responsible for developing tools for surveillance and enhancement of our production systems.You will work closely with our team to increase system resilience, investigate...


  • London, Greater London, United Kingdom Lorien Full time

    Key Responsibilities:Collaborate with the existing team to deliver a brand-new project.Work on a hybrid model with 1 day a week on-site in London.Develop and maintain reliable and efficient systems.Utilize experience with Java, Python, Splunk, ServiceNow, and MongoDB.Contribute to incident management and application monitoring.Ensure seamless interaction...