Site Reliability Engineer

2 weeks ago


Letchworth Garden City, United Kingdom WALT Labs Full time

At WALT Labs, we are committed to empowering businesses to leverage the transformative power of cloud technology, facilitating innovation and operational efficiency. Specializing in managed services across Google Cloud Platform (GCP) and Amazon Web Services (AWS), we seek a dedicated Site Reliability Engineer (SRE) who is passionate about technology, excels in problem-solving, and is dedicated to providing unparalleled customer service. You will become the SME to the scale, resiliency and uptime of our own and the customer environments we support.

Role Summary

As a critical member of our team, the SRE will provide technical support and expertise to our managed services clients. This role involves diagnosing and resolving complex issues across diverse cloud environments and technologies, ensuring high performance and reliability. The ideal candidate is a tech enthusiast, eager to expand their knowledge and skills daily, committed to problem-solving and delivering customer-focused solutions within defined Service Level Agreement (SLA) guidelines.

Key Responsibilities:
  • Ensure high availability and reliability of software systems and infrastructure. Building out SLOs & SLAs and constantly improving reliability of systems.
  • Design, implement, and maintain monitoring and alerting systems to detect and address issues proactively, using mainly Datadog, GCP Cloud Monitoring and Pagerduty/Incident.io.
  • Debug and troubleshoot production issues across various customer environments, technology stacks, and cloud providers, primarily focusing on GCP and AWS.
  • Participate in an on-call rotation to respond to and resolve production incidents and conduct RCAs/Post Mortems to identify and address issues.
  • Develop and maintain runbooks and playbooks for incident response and troubleshooting.
  • Proactively optimize systems and application environments to identify bottlenecks and areas of improvements.
  • Conduct load testing and capacity planning to ensure systems can handle expected traffic and growth.
  • Develop and maintain IaC (Terraform) and Configuration Management (Ansible, Helm as examples)
  • Work closely with development teams to understand system architecture, identify potential reliability risks, and implement solutions.
  • Collaborate with operations teams to ensure smooth deployment and operation of software systems.
  • Master a broad range of technologies, including but not limited to VMs, container orchestration, networking, security, databases, data warehouses, serverless technologies, and storage solutions.
  • Proficiently deploy applications into Kubernetes using Helm, and manage Kubernetes administration and troubleshooting.
  • Provide direct support to clients during production outages, offering expert assistance to swiftly rectify issues, adhering to SLA expectations.
  • Diligently document solutions and processes, constantly seeking to improve knowledge, skills, and operational efficiency.

Requirements

  • 3+ years experience in an SRE role
  • From your core you understand how important SLOs, SLIs and KPIs are to the systems you support, using observability to be your grounding point on a daily basis.
  • Extensive knowledge of all major services in GCP (Cloud Run, BigQuery, GKE etc)
  • In-depth knowledge of all major services in AWS
  • Experience in setting up and managing monitoring solutions like Datadog, Google Cloud Operations Suite, Cloudwatch, Nagios, and Zabbix.
  • Familiarity with various CI/CD systems (Jenkins, Codefresh, GitLab CI, GitHub Actions, Argo CD).
  • Exceptional problem-solving capabilities, the ability to work under pressure, and strong critical thinking skills.
  • Be the voice and commander of incidents managed internally and externally to customers
  • A passion for technology and an unquenchable thirst for learning new skills.
  • A customer-focused mindset, dedicated to delivering the highest level of service.

Benefits

  • We cover 100% of your base medical plan
  • Dental, vision, disability, and life insurance available
  • Generous PTO policy that increases with longevity
  • 401k
  • Professional development and advancement opportunities
  • Bonus incentives


  • Letchworth Garden City, United Kingdom Circle Recruitment Full time

    Site Reliability Engineer - Letchworth (Hybrid) DevOps Engineer - Site Reliability Engineer - Terraform - Kubernetes - GCP - Azure - Cloud Engineering - AWS - CI/CD - Grafana - Ansible - Configuration Management - IT Support - Incident Management - Troubleshooting Are you a tech-savvy professional with a passion for cloud infrastructure and reliability? Do...

  • Site Engineer

    7 days ago


    Letchworth Garden City, United Kingdom RTL Group Ltd Full time

    My client is a leading sub-contractor who cover UK Wide. They are looking to on-board a site engineer to manage the engineering of a new contract that they have won in Hertfordshire. The scope of works you will be required to manage includes setting out of RC Frames. Site engineer responsibilities: * Site set up and setting out. * As-built surveys. ...

  • Site Engineer

    7 days ago


    Letchworth Garden City, United Kingdom RTL Group Ltd Full time

    My client is a leading sub-contractor who cover UK Wide. They are looking to on-board a site engineer to manage the engineering of a new contract that they have won in Hertfordshire. The scope of works you will be required to manage includes setting out of RC Frames.Site engineer responsibilities: * Site set up and setting out. * As-built surveys. * QA. *...


  • City of London, Greater London, United Kingdom Square One Resources Full time

    Site Reliability Engineer | Remote | Application Development City of London Posted 4 days ago Work Type Contract Remote Work Yes IR35 Status Inside IR35 Job Title: Infrastructure Site Reliability Engineer (infra SRE) Location: Fully remote Salary/Rate: up to £710 inside IR35/ Day Start Date: 06/06/2024 Job Type: 6 Month Initial Contract (2-3...


  • City of London, Greater London, United Kingdom Mondrian Alpha Full time

    A world leading multi strat, systematic fund are seeking an automation heavy (python / powershell) infrastructure site reliability engineer who primarily has experience in windows environments and a specialism in storage.Read on to fully understand what this job requires in terms of skills and experience If you are a good match, make an 'd be joining an SRE...


  • City of London, Greater London, United Kingdom Square One Resources Full time

    Job Title: Infrastructure Site Reliability Engineer (infra SRE) Location: Fully remote Salary/Rate: up to £710 inside IR35/ Day Start Date: 06/06/2024 Job Type: 6 Month Initial Contract (2-3 year program) The Site reliability engineers (SREs) combine engineering experience and an innate drive to improve existing systems and processes, with the...


  • City of London, Greater London, United Kingdom Square One Resources Full time

    Fully remote Job Type: 6 Month Initial Contract (2-3 year program) The Site reliability engineers (SREs) combine engineering experience and an innate drive to improve existing systems and processes, with the creativity to develop novel solutions to evolving challenges. Restful services - RPC services)in one or more programming languages such as Go, Java, C...

  • Site Manager

    3 weeks ago


    Letchworth Garden City, United Kingdom Bennett & Game Recruitment Full time

    **Job Profile for Site Manager - SW156329** Our client, a Regional House Builder, based in Letchworth are seeking a Site Manager to join them on a full-time, permanent basis. The initial site is in Welwyn Garden City with further sites across Hertfordshire. The Site Manager, will be responsible forall day to day site activities reporting into the Contracts...


  • City of London, Greater London, United Kingdom Bayside Solutions Full time £91,400 - £108,000

    ContractLondon, England - Hybrid RoleWe seek a Site Reliability Engineer to join our team and play a crucial role in ensuring our applications and services' reliability, availability, and performance. This role requires a strong background in application support, monitoring, and cloud technologies, focusing on AWS, Azure, and Kubernetes. Java troubleshooting...


  • City of London, United Kingdom Investigo Full time

    SRE Contract, 6 Months, 3 days per week on site We are seeking a skilled Site Reliability Engineer (SRE) for a six-month contract for one of our consultancy clients. The role involves joining a project centred on developing applications to provide a cloud-based platform for client users. The ideal candidate should have a robust background in Agile,...

  • Reliability Engineer

    4 weeks ago


    Bristol (City Centre), United Kingdom MBDA Full time

    Bristol MBDA is a leading defence organisation.We are proud of the role we play in supporting the Armed Forces who protect our nations. We partner with governments to work together towards a common goal, defending our freedom.Salary: Up to£60,000depending on experienceWhat we can offer you:Company bonus of up to £2,500 (based on company performance and...


  • City of London, South East, United Kingdom Oliver Bernard Ltd Full time

    Site Reliability Engineer - Puppet SpecialistThe experience expected from applicants, as well as additional skills and qualifications needed for this job are listed below.A media client of ours is currently seeking a Site Reliability Engineer with expert Puppet experience to join their already well established team. The current team consists of around 10...


  • City of London, Greater London, United Kingdom Mondrian Alpha Full time

    My client, a leading high frequency trading firm, is seeking a database site reliability engineer to join their office in London.Apply fast, check the full description by scrolling below to find out the full requirements for this role.This is an opportunity to be the first individual to join a newly created team, and have the responsibility of setting the...


  • Welwyn Garden City, United Kingdom Premier Group Recruitment Full time

    **JOB- Reliability / Compliance Lead** **LOCATION- **Welwyn Garden City** **TERM- Permanent** **SALARY- £50,000 - £**63,000 per annum (dependant on experience)** We are looking for Compliance Engineer, Reliability Engineer or similar Engineer, Lead or Manager on a permanent basis in the Welwyn Garden City area with experience in the manufacturing...


  • City of London, Greater London, United Kingdom Investigo Full time

    SRE Contract, 6 Months, 3 days per week on site We are seeking a skilled Site Reliability Engineer (SRE) for a six-month contract for one of our consultancy clients. The role involves joining a project centred on developing applications to provide a cloud-based platform for client users. The ideal candidate should have a robust background in Agile,...


  • City of London, Greater London, United Kingdom Investigo Full time

    SRE Contract, 6 Months, 3 days per week on site We are seeking a skilled Site Reliability Engineer (SRE) for a six-month contract for one of our consultancy clients. The role involves joining a project centred on developing applications to provide a cloud-based platform for client users. The ideal candidate should have a robust background in Agile,...


  • Welwyn Garden City, Hertfordshire, United Kingdom Premier Group Recruitment Full time

    JOB- Reliability / Compliance LeadLOCATION- Welwyn Garden City**TERM- PermanentSALARY- £50,000 - £63,000 per annum (dependant on experience)**We are looking for Compliance Engineer, Reliability Engineer or similar Engineer, Lead or Manager on a permanent basis in the Welwyn Garden City area with experience in the manufacturing industry. Your main duty will...


  • Welwyn Garden City, United Kingdom Premier Engineering Full time

    JOB- Reliability / Compliance LeadLOCATION- Welwyn Garden CityTERM- PermanentSALARY- £48,000 - £63,000 per annum (dependant on experience)We are looking for Compliance Engineer, Reliability Engineer or similar Engineer, Lead or Manager on a permanent basis in the Welwyn Garden City area with experience in the manufacturing industry. Your main duty will be...


  • City of London, Greater London, United Kingdom Kioni Talent Full time £90,000 - £140,000

    60 second overview Company | Global FinTech Areas | SRE, DevOps, Software Engineering, Software Deployment Skills | Python, Java, Kubernetes, Terraform, SQL Based | London with option to work remotely 1 day per week E + bonus + benefits Kioni are partnering with a global FinTech who have established themselves as a household name within the capital...

  • Site engineer

    3 weeks ago


    Welwyn Garden City, United Kingdom RTL Group Ltd Full time

    My client is a leading sub contractor who cover UK-Wide. They are looking to on-board a site engineer to manage the engineering of a number of new contracts that they have won. The scope of works you will be required to manage includes setting out of Groundworks, Drainage & RC Frame. Site engineer responsibilities: * Site set up and setting out. *...