Site Reliability Engineering Lead

1 week ago

London, Greater London, United Kingdom LexisNexis Full time £120,000 - £150,000 per year

Are you passionate about building resilient systems and empowering teams to deliver reliable cloud solutions?
Do you thrive in designing and managing scalable platforms that keep services running smoothly?
*About Our Team:*
The LexisNexis Intellectual Property (IP) division ) provides international patent content and a suite of online and analytic tools that meet the evolving needs of the intellectual property market. We deliver data to support LexisNexis IP search and analytics applications, empowering our customers with actionable insights and metrics for critical business decisions.

Our corporate culture thrives on excellence, innovation, and a strong dedication to our customers, employees, and communities. Working here means joining a vibrant, diverse, and collaborative team where you are free to grow and contribute actively.

*About The Role:*
We are seeking a highly skilled and motivated Site Reliability Engineering Lead to lead a team responsible for ensuring the reliability, scalability, and resilience of mission-critical systems. This role is pivotal in managing senior engineers, driving operational excellence, and fostering a culture of continuous improvement.

You will collaborate closely with product, development, architecture, and security teams to implement best practices in site reliability engineering, cloud platform management, and environment support for internal development and customer systems. The manager will lead initiatives around incident response, disaster recovery, automation, monitoring, FinOps cost optimisation, and customer support escalations.

*Skills & Experience:*

Cloud Platforms & Services: Azure and AWS (EKS, EC2, S3, RDS, Lambda, Azure VMs, Functions).
Infrastructure as Code: Terraform, ARM/BICEP.
Containerization & Orchestration: Docker, Kubernetes (EKS/AKS), Helm, ArgoCD.
Monitoring & Observability: Datadog, Splunk, Coralogix, CloudWatch, Azure Monitor, along with an understanding of baseline metrics.
Scripting & Automation: Python, Bash, PowerShell, TypeScript, JavaScript.
Programming Knowledge: Java, .NET/C#, SQL, React (for integration with supported products).
Systems & Networking: Linux/UNIX/Windows administration, networking, and security best practices.
Specialised Knowledge: Databricks, FinOps cost management, disaster recovery planning.
Core Competencies: Incident management, troubleshooting, IT service management frameworks, and GitOps/DevOps practices.

*Soft Skills:*

Solid understanding of Site Reliability Engineering (SRE) principles and practices.
Strong understanding of incident management, monitoring tools, IT service management frameworks and automation processes.
Previous experience in customer-facing roles or managing customer support escalations
Excellent technical problem-solving and troubleshooting abilities.
Strong communication and interpersonal skills, with the ability to collaborate across teams.
Leadership skills with a track record of mentoring and guiding technical teams
Strong collaboration and advanced communication skills at the peer and senior management level.
Strong skills in setting, communicating, implementing, and achieving business objectives and goals through indirect leadership of and collaboration with others.
Strong organisation/project planning, time management, and change management skills across multiple functional groups and departments, and strong delegation skills involving prioritising and reprioritising projects and managing projects of various size and complexity.
Advanced problem-solving experience involving leading teams in identifying, researching, and coordinating the resources necessary to effectively troubleshoot/diagnose complex project issues; prior success extracting/translating findings into alternatives/solutions; and identifying risks/impacts and schedule adjustments to facilitate management decision-making.
Ability to manage multiple priorities and work effectively in a fast-paced environment.
Passion for continuous learning and staying up-to-date with industry trends and best practices.

*Responsibilities:*

Building & Leading the SRE Organisation -
Hire, mentor, and lead a team of SRE and platform engineers to ensure the timely and accurate performance of all team activities
Foster a culture of reliability, blameless post-mortems, and proactive incident prevention.
Define and implement SRE best practices for reliability, scalability, and performance.
Customer & Incident Management –
Manage intake, prioritisation, and resolution of critical customer-reported issues.
Act as an escalation point for high-severity incidents and outages.
In collaboration with Product Support and Development Managers, ensure SLAs, performance benchmarks, and response protocols are met.
Live System Monitoring & Support
Design and maintain robust monitoring, alerting, and incident response systems.
In collaboration with the Product Support Manager, lead incident management from detection to resolution and post-incident analysis.
Ensure system high availability goals are met.
Oversee disaster recovery and business continuity planning within IP Technology organisation.
Provide support for cloud resources management and workload capacity planning.
Drive automation to reduce manual intervention and improve efficiency.
Platform & Cloud Engineering
Support product development teams with infrastructure, non-functional requirements, and environment stability.
Manage Kubernetes deployments, Databricks environments, and other critical platforms.
Collaborate with cross-functional teams to deliver secure, reliable, and cost-effective platform and cloud solutions.
Ensuring all systems comply with security patching and vulnerability management tools.
In collaboration with architects, provide support for FinOps practices to monitor, optimise, and control cloud costs.
Leadership & Continuous Improvement -
Provide clear direction, performance evaluations, and career growth for team members.
Ensure proper documentation, reporting, and compliance with security and regulatory standards.
Promote continuous learning, knowledge sharing, and operational excellence.
Writing and reviewing documentation for the management, improvement, and support of platforms/assets.
Completing complex bug fixes and root-cause investigations.
Working closely with development and platform teams to understand requirements and translate them into high-quality solutions.
Implementing infrastructure management and deployment best practices, including code/solution reviews.
Operating in various development environments (Agile, Waterfall, etc.) while collaborating with key stakeholders.

Why Join Us?
Join our team and contribute to a culture of innovation, collaboration, and excellence. If you are ready to advance your career and make a significant impact, we encourage you to apply.

Work in a way that works for you
We promote a healthy work/life balance across the organisation. We offer an appealing working prospect for our people. With numerous wellbeing initiatives, shared parental leave, study assistance and sabbaticals, we will help you meet your immediate responsibilities and your long-term goals.

Working flexible hours - flexing the times when you work in the day to help you fit everything in and work when you are the most productive.

Working for you
We Know That Your Well-being And Happiness Are Key To a Long And Successful Career. These Are Some Of The Benefits We Are Delighted To Offer:

Dutch Share Purchase Plan
Annual Profit Share Bonus
Comprehensive Pension Plan
Home, office or commuting allowance
Generous vacation entitlement and option for sabbatical leave
Maternity, Paternity, Adoption and Family Care leave
Flexible working hours
Personal Choice budget
A variety of online training courses and career roadshows
Well-being programs and a gym facility in the office
Internal communities and networks
Various employee discounts
Recruitment introduction reward
Work from anywhere
Employee Assistance Program (global)
Annual Event

*About The Business*
A global leader in information and analytics, we help researchers and healthcare professionals advance science and improve health outcomes for the benefit of society. Building on our publishing heritage, we combine quality information and vast data sets with analytics to support visionary science and research, health education and interactive learning, as well as exceptional healthcare and clinical practice. At Elsevier, your work contributes to the world's grand challenges and a more sustainable future. We harness innovative technologies to support science and healthcare to partner for a better world.

Site Reliability Engineering Lead
Are you passionate about building resilient systems and empowering teams to deliver reliable cloud solutions?
Do you thrive in designing and managing scalable platforms that keep services running smoothly?
*About The Role:*
We are seeking a highly skilled and motivated Site Reliability Engineering Lead to lead a team responsible for ensuring the reliability, scalability, and resilience of mission-critical systems. This role is pivotal in managing senior engineers, driving operational excellence, and fostering a culture of continuous improvement.

You will collaborate closely with product, development, architecture, and security teams to implement best practices in site reliability engineering, cloud platform management, and environments support for internal development and customer systems. The manager will lead initiatives around incident response, disaster recovery, automation, monitoring, FinOps cost optimization, and customer support escalations.

*Skills & Experience:*

Cloud Platforms & Services: Azure and AWS (EKS, EC2, S3, RDS, Lambda, Azure VMs, Functions).
Infrastructure as Code: Terraform, ARM/BICEP.
Containerization & Orchestration: Docker, Kubernetes (EKS/AKS), Helm, ArgoCD.
Monitoring & Observability: Datadog, Splunk, Coralogix, CloudWatch, Azure Monitor along with understanding of baseline metrics.
Scripting & Automation: Python, Bash, PowerShell, TypeScript, JavaScript.
Programming Knowledge: Java, .NET/C#, SQL, React (for integration with supported products).
Systems & Networking: Linux/UNIX/Windows administration, networking, and security best practices.
Specialized Knowledge: Databricks, FinOps cost management, disaster recovery planning.
Core Competencies: Incident management, troubleshooting, IT service management frameworks, and GitOps/DevOps practices.

*Soft Skills:*

Solid understanding of Site Reliability Engineering (SRE) principles and practices.
Strong understanding of incident management, monitoring tools, IT service management frameworks and automation processes.
Previous experience in customer-facing roles or managing customer support escalations
Excellent technical problem-solving and troubleshooting abilities.
Strong communication and interpersonal skills, with the ability to collaborate across teams.
Leadership skills with a track record of mentoring and guiding technical teams
Strong collaboration and advanced communication skills at peer and senior management level.
Strong skills in setting, communicating, implementing, and achieving business objectives and goals through indirect leadership of and collaboration with others.
Strong organization/project planning, time management, and change management skills across multiple functional groups and departments, and strong delegation skills involving prioritizing and reprioritizing projects and managing projects of various size and complexity.
Advanced problem-solving experience involving leading teams in identifying, researching, and coordinating the resources necessary to effectively troubleshoot/diagnose complex project issues; prior success extracting/translating findings into alternatives/solutions; and identifying risks/impacts and schedule adjustments to facilitate management decision-making.
Ability to manage multiple priorities and work effectively in a fast-paced environment.
Passion for continuous learning and staying up-to-date with industry trends and best practices.

*Responsibilities:*

Building & Leading the SRE Organization -
Hire, mentor, and lead a team of SRE and platform engineers to ensure timely and accurate performance of all team activities
Foster a culture of reliability, blameless post-mortems, and proactive incident prevention.
Define and implement SRE best practices for reliability, scalability, and performance.
Customer & Incident Management –
Manage intake, prioritization, and resolution of critical customer-reported issues.
Act as an escalation point for high-severity incidents and outages.
In collaboration with Product Support and Development Managers, ensure SLAs, performance benchmarks, and response protocols are met.
Live System Monitoring & Support
Design and maintain robust monitoring, alerting, and incident response systems.
In collaboration with Product Support Manager, lead incident management from detection to resolution and post-incident analysis.
Ensure system high availability goals are met.
Oversee disaster recovery and business continuity planning within IP Technology organization.
Provide support for cloud resources management and workload capacity planning.
Drive automation to reduce manual intervention and improve efficiency.
Platform & Cloud Engineering
Support product development teams with infrastructure, non-functional requirements, and environment stability.
Manage Kubernetes deployments, Databricks environments, and other critical platforms.
Collaborate with cross-functional teams to deliver secure, reliable, and cost-effective platform and cloud solutions.
Ensuring all systems comply with security patching and vulnerability management tools.
In collaboration with architects, provide support for FinOps practices to monitor, optimize, and control cloud costs.
Leadership & Continuous Improvement -
Provide clear direction, performance evaluations, and career growth for team members.
Ensure proper documentation, reporting, and compliance with security and regulatory standards.
Promote continuous learning, knowledge sharing, and operational excellence.
Writing and reviewing documentation for the management, improvement, and support of platforms/assets.
Completing complex bug fixes and root-cause investigations.
Working closely with development and platform teams to understand requirements and translate them into high-quality solutions.
Implementing infrastructure management and deployment best practices, including code/solution reviews.
Operating in various development environments (Agile, Waterfall, etc.) while collaborating with key stakeholders.

Why Join Us?
Join our team and contribute to a culture of innovation, collaboration, and excellence. If you are ready to advance your career and make a significant impact, we encourage you to apply.

Work in a way that works for you
We promote a healthy work/life balance across the organization. We offer an appealing working prospect for our people. With numerous wellbeing initiatives, shared parental leave, study assistance and sabbaticals, we will help you meet your immediate responsibilities and your long-term goals.

Working flexible hours - flexing the times when you work in the day to help you fit everything in and work when you are the most productive.

Working for you
We Know That Your Well-being And Happiness Are Key To a Long And Successful Career. These Are Some Of The Benefits We Are Delighted To Offer:

Dutch Share Purchase Plan
Annual Profit Share Bonus
Comprehensive Pension Plan
Home, office or commuting allowance
Generous vacation entitlement and option for sabbatical leave
Maternity, Paternity, Adoption and Family Care leave
Flexible working hours
Personal Choice budget
Variety of online training courses and career roadshows
Wellbeing programs and gym facility in the office
Internal communities and networks
Various employee discounts
Recruitment introduction reward
Work from anywhere
Employee Assistance Program (global)
Annual Event

Lead Site Reliability Engineer

1 week ago

London, Greater London, United Kingdom JPMorganChase Full time £80,000 - £120,000 per year

DescriptionAssume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability. As a Lead Site Reliability Engineer at JPMorgan Chase within the Infrastructure Platforms team, you hold a leadership role in your team, demonstrate strong knowledge...
Lead Site Reliability Engineer

3 days ago

London, Greater London, United Kingdom JPMorgan Chase Full time £80,000 - £150,000 per year

Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability. As a Lead Site Reliability Engineer at JPMorgan Chase within the Infrastructure Platforms team, you hold a leadership role in your team, demonstrate strong knowledge across multiple...
Lead Site Reliability Engineer

6 days ago

London, Greater London, United Kingdom JPMorgan Chase Full time £60,000 - £120,000 per year

Join us and make a real impact by shaping the future of technology at JPMorgan Chase. As a Lead Site Reliability Engineer, you'll collaborate with talented colleagues to deliver and operate firmwide solutions that power our business. You'll have the opportunity to grow your career, apply your technical expertise, and solve diverse challenges across multiple...
Lead Site Reliability Engineer

2 weeks ago

London, Greater London, United Kingdom hackajob Full time £120,000 - £180,000 per year

hackajob*is collaborating withJ.P. Morgan*to connect them with exceptional tech professionals for this role.Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.As a Lead Site Reliability Engineer at JPMorgan Chase within the...
Site Reliability Engineer

1 week ago

London, Greater London, United Kingdom La Fosse Full time £6,600 - £66,200 per year

Contract Opportunity: Site Reliability Engineer (Azure & AWS)Location:UK (Hybrid/Remote)Rate:£550/day (Inside IR35)Contract Length:12 Months InitallyThe client is looking for ahighly skilled Site Reliability Engineer (SRE)with deep experience acrossAzure and AWSto take a lead role in migrating an existing on-premHPC solution into the Cloud. You'll be...
Site Reliability Engineer

6 days ago

London, Greater London, United Kingdom Ditto Full time £60,000 - £120,000 per year

About Ditto:Ditto is redefining how data moves at the edge. Our mission is to make it seamless for developers to build resilient, real-time applications, regardless of network conditions. Whether you're in a stadium, airplane, or remote military base, Ditto's peer-to-peer sync engine ensures devices stay connected and data stays consistent, even without...
Site Reliability Engineer

1 week ago

London, Greater London, United Kingdom eMFusion Global Full time £60,000 - £120,000 per year

Job Opportunity: Freelance Site Reliability Engineer (Outside IR35)£ | Remote (UK-Based) | Occasional travel to Farnborough or HammersmithContract until 2026We're hiring two hands-on Site Reliability Engineers (SREs) to join a fast-moving platform team on a long-term contract. This role is ideal for engineers with strong coding skills who are comfortable...
Site Reliability Engineer

4 days ago

London, Greater London, United Kingdom Group Full time £40,000 - £80,000 per year

**Site Reliability Engineer- UK**Optum is a global organisation that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture...
Site Reliability

1 week ago

London, Greater London, United Kingdom Arrows Full time

Site Reliability Engineer (Contract)- Up to £650 per day (Inside IR35)2 days per week onsite in OsterleyI'm working with a leading media and technology client that's building next-generation digital platforms used by millions across the UK. They're looking for an experienced Site Reliability Engineer to join their growing team and help drive automation,...
Lead Cloud Site Reliability Engineer

2 days ago

London, Greater London, United Kingdom LSA Recruit Full time £80,000 - £120,000 per year

Role: Lead Cloud Site Reliability Engineer (SRE)ContractClearance Level Required for RoleSC clearance ( Active)ExperienceExperience 10+ YearsBase Location:Hybrid Role with 3 days travel to London officeJDGiven belowJob Description –We're looking for aLead Cloud Site Reliability Engineer (SRE)with strong expertise inAzure, Kubernetes, Terraform, and...

Americas

Europe

Asia / Oceania

Africa

Site Reliability Engineering Lead