Current jobs related to Site Reliability Engineer, Cloud Incident Response - City Of London - SS&C Technologies Holdings


  • City Of London, United Kingdom SS&C Technologies Full time

    As a leading financial services and healthcare technology company based on revenue, SS&C is headquartered in Windsor, Connecticut, and has 27,000+ employees in 35 countries. Some 20,000 financial services and healthcare organizations, from the world's largest companies to small and mid-market firms, rely on SS&C for expertise, scale, and technology. Job...


  • Greater London, United Kingdom SS&C Technologies Full time

    Site Reliability Engineer, Cloud Incident Response As a leading financial services and healthcare technology company based on revenue, SS&C is headquartered in Windsor, Connecticut, and has 27,000+ employees in 35 countries. Some 20,000 financial services and healthcare organizations, from the world's largest companies to small and mid‑market firms, rely...


  • London, United Kingdom SS&C Full time

    As a leading financial services and healthcare technology company based on revenue SS&C is headquartered in Windsor Connecticut and has 27000 employees in 35 countries. Some 20000 financial services and healthcare organizations from the worlds largest companies to small and mid-market firms rely on SS&C for expertise scale and technology.Job...


  • London, Greater London, United Kingdom SS&C Full time £90,000 - £120,000 per year

    As a leading financial services and healthcare technology company based on revenue, SS&C is headquartered in Windsor, Connecticut, and has 27,000+ employees in 35 countries. Some 20,000 financial services and healthcare organizations, from the world's largest companies to small and mid-market firms, rely on SS&C for expertise, scale, and technology.Job...


  • London, Greater London, United Kingdom SS&C Technologies Full time £90,000 - £120,000 per year

    As a leading financial services and healthcare technology company based on revenue, SS&C is headquartered in Windsor, Connecticut, and has 27,000+ employees in 35 countries. Some 20,000 financial services and healthcare organizations, from the world's largest companies to small and mid-market firms, rely on SS&C for expertise, scale, and technology.Job...


  • City Of London, United Kingdom N Consulting Limited Full time

    LocationLondon, United Kingdom# Site Reliability Engineer at N Consulting LtdLocationLondon, United KingdomSalary£70000 - £75000 /yearJob TypeContractDate PostedSeptember 22nd, 2025Apply NowRole : Site Reliability Engineer (SRE)Location : LondonWork Mode : HybridContract RoleJob Description:A Site Reliability Engineer is responsible for transforming the...


  • City Of London, United Kingdom Google Inc. Full time

    Senior Software Engineer, Site Reliability Engineering, Cloud IRT corporate_fare Google place London, UK Apply Bachelor’s degree in Computer Science, a related field, or equivalent practical experience. 5 years of experience with software development in one or more programming languages. 3 years of experience in designing, analyzing, and troubleshooting...


  • City Of London, United Kingdom SS&C Technologies Holdings Full time

    A global financial services firm in London is seeking a skilled Site Reliability Engineer to enhance production reliability. In this hybrid role, you will leverage your expertise in observability and cloud infrastructure to ensure system operability and scalability. Ideal candidates have over 5 years of experience and practical knowledge of Kubernetes and...


  • London Area, United Kingdom Response Informatics Full time £60,000 - £120,000 per year

    Job Description –We're looking for aLead Cloud Site Reliability Engineer (SRE)with strong expertise inAzure, Kubernetes, Terraform, and GitHubto lead large-scale projects and mentor a growing team.Key ResponsibilitiesLead SRE activities for large-scale cloud projects, providing technical guidance to engineers.Deliver solutions across VMs and Kubernetes ,...


  • City Of London, United Kingdom Natobotics Full time

    OverviewJoin to apply for the Site Reliability Engineer role at Natobotics.Location: London. Work Mode: Hybrid. Contract Role.Experience Level: 15+ Years.A Site Reliability Engineer is responsible for transforming the SDLC environment with engineering-focused role that emphasizes system reliability, automation, and performance in a non-production...

Site Reliability Engineer, Cloud Incident Response

2 weeks ago


City Of London, United Kingdom SS&C Technologies Holdings Full time

Job Description SS&C is leading the way. We continually look for today’s and tomorrow’s brightest talent, those who embody a spirit to improve not only their lives but those around them. From college students to seasoned professionals, we encourage you to apply. SS&C prides itself on hiring diverse, honest, dynamic individuals who value collaboration, accountability, and innovation to name a few. Site Reliability Engineer Location: London office, hybrid – 2 days per week onsite About the Role We’re seeking a hands‑on Site Reliability Engineer to enhance our production reliability, scalability, and operability. You’ll use your expertise across observability, Kubernetes, AWS, and infrastructure as code to investigate issues, implement tactical fixes quickly, and drive strategic improvements that raise availability and reduce toil. This is a hybrid role with two days per week in the office. You’ll collaborate closely with engineering, product, and support to design, build, and run robust platforms that meet demanding SLAs/SLOs. What You’ll Do Keep production healthy: Monitor, troubleshoot, and resolve incidents across services and infrastructure; reduce MTTR and prevent recurrences through high‑quality post‑incident actions. Observability as a first‑class practice: Use Grafana, Datadog, and Splunk (and related tools like Prometheus/OpenTelemetry) to detect anomalies, root cause issues, and create actionable alerts and dashboards. Run Kubernetes at scale: Operate and harden Kubernetes (EKS preferred); manage deployments, autoscaling, rollouts/rollbacks, service mesh/ingress, and cluster upgrades. Build reliable cloud foundations: Design and operate AWS workloads (networking, IAM, EC2/EKS, RDS/Aurora, S3, CloudWatch, ALB/NLB, VPC, Security Groups) with a security‑first mindset. Automate with IaC: Codify and continuously improve infrastructure using Terraform (modules, workspaces, remote state, policy as code). Enable fast, safe delivery: Partner with teams to enhance CI/CD pipelines (e.g., GitHub Actions/Jenkins/Argo CD), progressive delivery, and change management to lower the change failure rate. Own reliability metrics: Define and iterate on SLOs/SLIs/error budgets; champion blameless post‑mortems and reliability reviews. Participate in on‑call: Join a fair, well‑documented on‑call rota; improve runbooks, automation, and alert quality to make on‑call sustainable. Drive strategic improvements: Identify systemic issues and deliver durable fixes (architecture, capacity, scaling, caching, resilience patterns, rate limiting, back‑pressure, circuit breakers, chaos engineering). What You Will Bring 5+ years operating production systems as an SRE, DevOps engineer, or software engineer. Observability: Hands‑on with Grafana, Datadog, and Splunk for incident investigation, dashboarding, alerting, tracing/logs/metrics correlation, and performance analysis. Kubernetes: Strong experience running and troubleshooting workloads (controllers, pods, networking, storage, HPA/VPA, Helm). AWS: Solid practical knowledge of core services and best practices for security, cost, and reliability. Terraform: Confident with module design, state management, DRY patterns, and CI for IaC. On‑call experience: Demonstrated participation in a production on‑call rota, effective incident communication, and post‑incident follow‑through. Collaboration & communication: Ability to work cross‑functionally, write clear runbooks/RFCs, and influence engineering practices. Nice‑to‑Have EKS internals, cluster autoscaler, managed node groups/Fargate; service mesh (Istio/Linkerd), ingress controllers (Nginx/ALB). Prometheus, OpenTelemetry, Loki/Tempo, alert tuning and SLO burn‑rate alerts. Argo CD/FluxCD, Helm chart authoring, Kustomize. CD patterns (blue/green, canary, feature flags), GitOps workflows. Database operations (Postgres/MySQL), caching (Redis), message queues (Kafka/SQS). Security & compliance (CIS benchmarks, IAM boundaries, secrets management, Vault/Sealed Secrets). Resilience testing/chaos engineering. Relevant certs (AWS Solutions Architect/DevOps Engineer, CKA/CKAD, Terraform Associate). How We Work Hybrid: Two days per week in the office for collaboration and incident/architecture reviews; remote the rest. Engineering excellence: Blameless culture, well‑defined SLOs, automation‑first, and continuous learning. Impact focus: Measure success via availability, latency, MTTR, change failure rate, toil reduction, and customer outcomes. On‑Call Expectations Participate in a rotating on‑call schedule with clear escalation paths. Improve alert signal‑to‑noise ratio and operational readiness (dashboards, runbooks, playbooks). Post‑incident reviews focused on learning and durable improvements—no blame. Benefits Competitive salary + bonus (DOE) Pension and comprehensive benefits Modern tooling and time allocated for reliability improvements We encourage applications from people of all backgrounds to enable us to bring diverse perspectives to our thinking and conversation. It's important to us that we strive to have a workforce that is diverse in the widest sense. #J-18808-Ljbffr