Site Reliability Engineer, Cloud Incident Response

2 weeks ago

London, Greater London, United Kingdom SS&C Full time £90,000 - £120,000 per year

As a leading financial services and healthcare technology company based on revenue, SS&C is headquartered in Windsor, Connecticut, and has 27,000+ employees in 35 countries. Some 20,000 financial services and healthcare organizations, from the world's largest companies to small and mid-market firms, rely on SS&C for expertise, scale, and technology.

Job Description

Get To Know Us:

SS&C is leading the way. We continue to look for today's and tomorrow's brightest talent, those who embody a spirit to improve not only their lives, but those around them. From college students to seasoned and experienced professionals, we encourage you to apply. SS&C prides itself on hiring diverse, honest, dynamic individuals who value collaboration, accountability, and innovation, to name a few.

Site Reliability Engineer

Location: London office, hybrid — 2 days per week onsite

About the Role

We're seeking a hands-on Site Reliability Engineer to enhance our production reliability, scalability, and operability. You'll use your expertise across observability, Kubernetes, AWS, and infrastructure as code to investigate issues, implement tactical fixes quickly, and drive strategic improvements that raise availability and reduce toil. This is a hybrid role with two days per week in the office. You'll collaborate closely with engineering, product, and support to design, build, and run robust platforms that meet demanding SLAs/SLOs.

What You'll Do

Keep production healthy: Monitor, troubleshoot, and resolve incidents across services and infrastructure; reduce MTTR and prevent recurrences through high-quality post-incident actions.
Observability as a first‑class practice: Use Grafana, Datadog, and Splunk (and related tools like Prometheus/OpenTelemetry) to detect anomalies, root cause issues, and create actionable alerts and dashboards.
Run Kubernetes at scale: Operate and harden Kubernetes (EKS preferred); manage deployments, autoscaling, rollouts/rollbacks, service mesh/ingress, and cluster upgrades.
Build reliable cloud foundations: Design and operate AWS workloads (networking, IAM, EC2/EKS, RDS/Aurora, S3, CloudWatch, ALB/NLB, VPC, Security Groups) with a security-first mindset.
Automate with IaC: Codify and continuously improve infrastructure using Terraform (modules, workspaces, remote state, policy as code).
Enable fast, safe delivery: Partner with teams to enhance CI/CD pipelines (e.g., GitHub Actions/Jenkins/Argo CD), progressive delivery, and change management to lower the change failure rate.
Own reliability metrics: Define and iterate on SLOs/SLIs/error budgets; champion blameless post‑mortems and reliability reviews.
Participate in on‑call: Join a fair, well‑documented on‑call rota; improve runbooks, automation, and alert quality to make on‑call sustainable.
Drive strategic improvements: Identify systemic issues and deliver durable fixes (architecture, capacity, scaling, caching, resilience patterns, rate limiting, back‑pressure, circuit breakers, chaos engineering).

What you will bring

5+ years operating production systems as an SRE, DevOps engineer, or software engineer.
Observability: Hands‑on with Grafana, Datadog, and Splunk for incident investigation, dashboarding, alerting, tracing/logs/metrics correlation, and performance analysis.
Kubernetes: Strong experience running and troubleshooting workloads (controllers, pods, networking, storage, HPA/VPA, Helm/Customise).
AWS: Solid practical knowledge of core services and best practices for security, cost, and reliability.
Terraform: Confident with module design, state management, DRY patterns, and CI for IaC.
On‑call experience: Demonstrated participation in a production on‑call rota, effective incident communication, and post‑incident follow‑through.
Scripting & engineering fundamentals: Proficiency in at least one of Python, Go, or Bash; strong Linux, networking (DNS, TLS, HTTP, TCP), and Git.
Collaboration & communication: Ability to work cross‑functionally, write clear runbooks/RFCs, and influence engineering practices.

Nice‑to‑Have

EKS internals, cluster autoscaler, managed node groups/Fargate; service mesh (Istio/Linkerd), ingress controllers (Nginx/ALB).
Prometheus, OpenTelemetry, Loki/Tempo, alert tuning and SLO burn‑rate alerts.
Argo CD/FluxCD, Helm chart authoring, Kustomize.
CD patterns (blue/green, canary, feature flags), GitOps workflows.
Database operations (Postgres/MySQL), caching (Redis), message queues (Kafka/SQS).
Security & compliance (CIS benchmarks, IAM boundaries, secrets management, Vault/Sealed Secrets).
Resilience testing/chaos engineering.
Relevant certs (AWS Solutions Architect/DevOps Engineer, CKA/CKAD, Terraform Associate).

How We Work

Hybrid: Two days per week in the office for collaboration and incident/architecture reviews; remote the rest.
Engineering excellence: Blameless culture, well-defined SLOs, automation‑first, and continuous learning.
Impact focus: Measure success via availability, latency, MTTR, change failure rate, toil reduction, and customer outcomes.

On‑Call Expectations

Participate in a rotating on-call schedule with clear escalation paths.
Improve alert signal-to-noise ratio and operational readiness (dashboards, runbooks, playbooks).
Post-incident reviews focused on learning and durable improvements—no blame.

Benefits

Competitive salary + bonus (DOE)
Pension and comprehensive benefits
Modern tooling and time allocated for reliability improvements

We encourage applications from people of all backgrounds to enable us to bring diverse perspectives to our thinking and conversation. It's important to us that we strive to have a workforce that is diverse in the widest sense.

Thank you for your interest in SS&C If applicable, to further explore this opportunity, please apply directly with us through our Careers page on our corporate website @

Unless explicitly requested or approached by SS&C Technologies, Inc. or any of its affiliated companies, the company will not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services.

SS&C Technologies is an Equal Employment Opportunity employer and does not discriminate against any applicant for employment or employee on the basis of race, color, religious creed, gender, age, marital status, sexual orientation, national origin, disability, veteran status or any other classification protected by applicable discrimination laws.

Site Reliability Engineer, Cloud Incident Response

2 weeks ago

London, Greater London, United Kingdom SS&C Technologies Full time £90,000 - £120,000 per year

As a leading financial services and healthcare technology company based on revenue, SS&C is headquartered in Windsor, Connecticut, and has 27,000+ employees in 35 countries. Some 20,000 financial services and healthcare organizations, from the world's largest companies to small and mid-market firms, rely on SS&C for expertise, scale, and technology.Job...
Site Reliability Engineer

6 days ago

London, Greater London, United Kingdom WALT Labs Full time £60,000 - £80,000 per year

WALT Labs, a leading managed service provider, is dedicated to empowering businesses by harnessing the power of cloud technology. Our team specializes in delivering customized solutions tailored to meet the unique needs of our clients, driving growth and operational efficiency across industries. From supporting small businesses with seamless data migration...
Platform Engineer

4 days ago

London, Greater London, United Kingdom incident Full time

About is the leading all-in-one platform for incident management. From small bugs to major outages, helps teams respond fast, reduce downtime, and improve every time something goes wrong.Since launching in 2021, we've helped 800 companies—including Netflix, Airbnb and Block—resolve over 250,000 incidents. Every month, more than 30,000 responders across...
Platform Engineer

4 days ago

London, Greater London, United Kingdom incident Full time

About is the leading all-in-one platform for incident management. From small bugs to major outages, helps teams respond fast, reduce downtime, and improve every time something goes wrong.Since launching in 2021, we've helped 800 companies—including Netflix, Airbnb and Block—resolve over 250,000 incidents. Every month, more than 30,000 responders across...
Senior Software Engineer, Site Reliability Engineering, Cloud IRT

4 days ago

London, Greater London, United Kingdom Google Full time £60,000 - £120,000 per year

Minimum qualifications:Bachelor's degree in Computer Science, a related field, or equivalent practical experience.5 years of experience with software development in one or more programming languages.3 years of experience in designing, analyzing, and troubleshooting distributed systems.2 years of experience leading projects and providing technical leadership....
Senior Software Engineer, Site Reliability Engineering, Cloud IRT

5 hours ago

London, Greater London, United Kingdom Google Full time £80,000 - £150,000 per year

Minimum qualifications:Bachelor's degree in Computer Science, a related field, or equivalent practical experience.5 years of experience with software development in one or more programming languages.3 years of experience in designing, analyzing, and troubleshooting distributed systems.2 years of experience leading projects and providing technical leadership....
Site Reliability Engineer

1 week ago

London, Greater London, United Kingdom La Fosse Full time £6,600 - £66,200 per year

Contract Opportunity: Site Reliability Engineer (Azure & AWS)Location:UK (Hybrid/Remote)Rate:£550/day (Inside IR35)Contract Length:12 Months InitallyThe client is looking for ahighly skilled Site Reliability Engineer (SRE)with deep experience acrossAzure and AWSto take a lead role in migrating an existing on-premHPC solution into the Cloud. You'll be...
Site Reliability Engineer

2 days ago

London, Greater London, United Kingdom Group Full time £40,000 - £80,000 per year

**Site Reliability Engineer- UK**Optum is a global organisation that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture...
Site Reliability Engineer

4 days ago

London, Greater London, United Kingdom Ditto Full time £60,000 - £120,000 per year

About Ditto:Ditto is redefining how data moves at the edge. Our mission is to make it seamless for developers to build resilient, real-time applications, regardless of network conditions. Whether you're in a stadium, airplane, or remote military base, Ditto's peer-to-peer sync engine ensures devices stay connected and data stays consistent, even without...
Senior Software Engineer, Site Reliability Engineering, Cloud IRT

2 days ago

London, Greater London, United Kingdom Google Full time £80,000 - £120,000 per year

Minimum qualifications:Bachelor's degree in Computer Science, a related field, or equivalent practical experience.5 years of experience with software development in one or more programming languages.3 years of experience in designing, analyzing, and troubleshooting distributed systems.2 years of experience leading projects and providing technical...

Americas

Europe

Asia / Oceania

Africa

Site Reliability Engineer, Cloud Incident Response