Lead Site Reliability Engineer

3 weeks ago


London, Greater London, United Kingdom Department for Work and Pensions Full time

Position Overview

Are you adept at managing stakeholder relationships effectively?

Do you enjoy diagnosing issues and creating automated solutions to prevent recurrence?

If this resonates with you, we invite you to explore this opportunity.

In the role of Senior Site Reliability Engineer, you will champion the implementation of SRE best practices throughout our cloud infrastructure.

Leveraging both your interpersonal skills and technical knowledge, you will collaborate with various teams to ensure compliance with standards and governance during the onboarding of our services into the cloud, facilitated by a structured assessment process. This will ensure that our applications serving citizens meet all operational and security requirements for production environments.

Your responsibilities will include executing deployments using established runbooks, investigating production incidents, and providing specialized support to teams in identifying root causes.

You will strive to minimize manual work and enhance automation by developing reliable processes, thereby reducing the time and costs associated with repetitive tasks.

Collaboration with development teams will be key, as you will provide guidance on best practices and ensure that application monitoring is effectively implemented.

Successful candidates will be expected to offer on-call support to assist in service restoration, utilizing runbooks or their technical expertise.

This role may require occasional travel to various digital hubs, with frequency discussed during the selection process.

Please note that this position requires passing a security clearance. For further details, refer to the 'Selection process details'.

Role Responsibilities

The SRE team will empower you to collaborate with application teams across the organization in developing reliable and secure solutions for citizens.

You will engage with development teams from the design phase, ensuring adherence to best practices and departmental standards in building application infrastructure.

Key responsibilities include:

  • Providing expert advice and guidance to internal and external stakeholders.
  • Designing and implementing strategies to enhance application reliability, including runbooks and knowledge transfer to the User Experience Command Centre (UXCC), as well as ongoing SRE strategy development.
  • Managing the error budget in collaboration with product owners and ensuring balanced workload alignment.
  • Acting as the primary contact for investigating and resolving major or complex incidents, ensuring the right expertise is available for effective response.
  • Evaluating the impact of change requests in consultation with stakeholders, providing technical insights and authorizing subsequent changes.
  • Overseeing on-call rotations to ensure all applications have adequate out-of-hours SRE coverage.
  • Coaching and mentoring application development and operations engineers in SRE practices and techniques.
  • Conducting retrospectives for high-priority and major incidents, ensuring timely publication of findings.
  • Regularly soliciting feedback and ideas from stakeholders and team members to foster improvement and innovation.
  • Facilitating interdepartmental discussions and meetings with various external organizations, leading community discussions on SRE best practices within Engineering.

Candidate Profile

When detailing your employment history and personal statement, please highlight your experience in relation to the essential criteria below:

  • * LEAD CRITERIA: Proficiency in scripting to automate processes and eliminate manual tasks, including infrastructure and configuration as code.
  • Experience in building and enhancing CI/CD pipelines.
  • Track record of resolving complex technical incidents.
  • Expertise in reliability engineering, including capacity and performance management through monitoring, logging, and alerting.
  • Familiarity with orchestration platforms and tools for managing containerized applications.
  • Experience engaging with stakeholders at various levels to provide feedback and support.

An initial assessment may be conducted based on the lead criteria outlined above. Candidates who pass this initial assessment will proceed to a comprehensive evaluation.

Benefits

• Employer pension contribution of up to a specified percentage.
• Annual leave entitlement increasing up to 30 days, depending on your working pattern.
• Flexible working arrangements, including hybrid working, job sharing, term-time working, flexi-time, and compressed hours.
• Tailored learning and development opportunities, which may include industry-recognized qualifications, coaching, and mentoring.
• An inclusive and diverse workplace with opportunities to join various staff networks.

Salary Information

Compensation for this role ranges from a specified minimum to a specified maximum.

The maximum salary for this grade is set at a specified amount, with a potential Digital Allowance available for exceptional candidates based on skills and experience.

Offers to successful candidates will be determined based on an assessment of skills and experience demonstrated during the interview process.

Current Civil Servants transitioning to a new role should maintain their existing salary, while those gaining promotion may move to the bottom of the next grade pay scale or receive a specified percentage increase, whichever is greater.



  • London, Greater London, United Kingdom GoCardless Full time

    About GoCardless:At GoCardless, we are committed to revolutionizing the payment landscape by leveraging bank payments as the most efficient means for both sending and receiving funds. We also recognize the significant role of bank account data in enabling faster and more informed decision-making. Our mission is to streamline the utilization of bank payments...


  • London, Greater London, United Kingdom Legal & General Full time

    About the RoleWe are seeking a seasoned Site Reliability Engineer to join our team at Legal & General. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability and scalability of our systems, working closely with development, architecture, and service management teams.Key ResponsibilitiesSystem Reliability and Scalability:...


  • London, Greater London, United Kingdom RemoteStar Full time

    Remote Senior Site Reliability Engineer LeadRemoteStar is seeking a highly skilled Remote Senior Site Reliability Engineer Lead to join our client's team in the UK. This is a fully remote work opportunity.The client is a leading B2B diamond and gemstones marketplace, connecting jewellery retailers to gemstone suppliers.Job SummaryAs the SRE Lead, you will...


  • London, Greater London, United Kingdom Legal & General Full time

    About the RoleWe are seeking a seasoned Site Reliability Engineer to join our team at Legal & General. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability and scalability of our systems, working closely with development, architecture, and service management teams.Key ResponsibilitiesSystem Reliability and Scalability:...


  • London, Greater London, United Kingdom loveholidays Full time

    Company OverviewAt loveholidays, we are a dynamic online travel agency dedicated to utilizing innovative technology to enhance our services. Our goal is to facilitate unforgettable travel experiences for countless individuals each year.Position SummaryWe are in search of a skilled Site Reliability Engineer to become a vital member of our Platform...


  • London, Greater London, United Kingdom Opus Recruitment Solutions Full time

    Site Reliability Engineer | Remote | Competitive SalaryCloud Computing | DevOps | Google Cloud Platform | Amazon Web Services | Kubernetes | Infrastructure | SRE | ELK StackWe are collaborating with a dynamic online retail company seeking to enhance their technical team by adding a Site Reliability Engineer. This role focuses on managing the reliability and...


  • London, Greater London, United Kingdom Apple Inc. Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our Apple Services Engineering team. As a key member of our team, you will play a critical role in supporting and scaling cloud services for thousands of development and operations engineers.Key ResponsibilitiesCloud Service Maintenance: Automate deployment and orchestration of...


  • London, Greater London, United Kingdom Mondrian Alpha Full time

    About Mondrian AlphaMondrian Alpha is a renowned hedge fund with a global presence, seeking a seasoned Site Reliability Engineer to join their London team.Job SummaryWe are looking for a highly skilled Site Reliability Engineer to play a pivotal role in maintaining the technology infrastructure that drives our operations, directly contributing to our...


  • London, Greater London, United Kingdom Apple Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our Apple Services Engineering team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our services.Key ResponsibilitiesDesign, implement, and maintain large-scale distributed systems and...


  • London, Greater London, United Kingdom Experian Full time

    Job Opportunity for a Skilled Site Reliability EngineerWe are seeking a highly skilled and driven Site Reliability Engineer to join our dedicated team at Experian Data Quality in London, with a flexible working arrangement.As a key member reporting to the QA Director, you will be responsible for ensuring the dependability, efficiency, and scalability of our...


  • London, Greater London, United Kingdom WeAreTechWomen Full time

    About the RoleWe are seeking a skilled Site Reliability Engineering Specialist to join our team at WeAreTechWomen. As a Site Reliability Engineer, you will be responsible for ensuring the resilience and reliability of our firm's critical platform services.Key ResponsibilitiesCollaborate with our businesses to build and run resilient and reliable production...


  • London, Greater London, United Kingdom WeAreTechWomen Full time

    About the RoleWe are seeking a skilled Site Reliability Engineering Specialist to join our team at WeAreTechWomen. As a Site Reliability Engineer, you will play a critical role in ensuring the resilience and reliability of our firm's most critical platform services.Key ResponsibilitiesCollaborate with our businesses to build and run resilient and reliable...


  • London, Greater London, United Kingdom Google Full time

    About the RoleAs a Site Reliability Engineering Manager at Google, you will be responsible for leading a team of Software/Systems Engineers on projects that impact users globally. Your primary focus will be on ensuring the uptime and availability of key services, while also building automation to prevent problem recurrence.You will be directly responsible...


  • London, Greater London, United Kingdom Harrington Starr Full time

    Job OverviewLead Site Reliability Engineer - Remote OpportunityInnovative Start-up EnvironmentCompensation: £95,000 - £105,000 base salaryPosition SummaryWe are excited to invite applications for the role of Lead Site Reliability Engineer as our client, a dynamic start-up, is poised for significant growth and the launch of essential services. This position...


  • London, Greater London, United Kingdom Sterlings Full time

    Job Opportunity at SterlingsKubernetes Site Reliability Engineer - Investment BankingSterlings, a leading global investment bank, is seeking a highly skilled Kubernetes Site Reliability Engineer to join our Site Reliability Engineering team.Key Responsibilities:Design and implement scalable and highly available infrastructure services using...


  • London, Greater London, United Kingdom Robert Walters Full time

    Job DescriptionSENIOR SITE RELIABILITY ENGINEERSalary: £100,000 + 5% bonusLocation: London, hybrid working with 2 days per week in the officeWe are thrilled to present a remarkable opportunity for a Senior Site Reliability Engineer to join our team at Robert Walters as a Workforce Consultant. As an Employed Workforce Consultant, you will enjoy the benefits...


  • London, Greater London, United Kingdom Trust In SODA Full time

    Job OverviewPosition: Site Reliability Engineering ManagerIndustry: InsurTechLocation: RemoteSalary: £75,000 - £85,000Benefits: Bonus, Equity Options, Comprehensive Health Coverage, Learning & Development Fund, 25 Days Annual Leave, Flexibility for International WorkAre you eager to join a fast-growing InsurTech firm that is transforming the Premium...


  • London, Greater London, United Kingdom WeAreTechWomen Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our Cloud Infrastructure team. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and resilience of our firm's critical platform services.Key ResponsibilitiesCollaborate with cross-functional teams to design, implement, and operate scalable and...


  • London, Greater London, United Kingdom WeAreTechWomen Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our Cloud Infrastructure team. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and resilience of our firm's critical platform services.Key ResponsibilitiesCollaborate with cross-functional teams to design, implement, and operate scalable and...


  • London, Greater London, United Kingdom MRJ Recruitment Full time

    Senior Reliability Engineer (SRE) RoleOur leading retail sector client is seeking a skilled SRE to collaborate with their team and enhance deployment practices to minimize downtime, expedite troubleshooting, and facilitate smooth reversals.Work alongside a diverse and supportive team to contribute to groundbreaking projects and enjoy a collaborative...