AWS Site Reliability Engineer

2 weeks ago


London, United Kingdom Techruiter Full time

Site Reliability Engineer (SRE) - LLM and Machine Learning
London/Remote
Roles we're searching for now: – Software Engineering /
We are a pioneering technology company specialising in cutting-edge Language Models (LLM) and Machine Learning solutions. We are seeking a highly skilled Site Reliability Engineer (SRE) to join our team and ensure the reliability, scalability, and performance of our LLM and Machine Learning infrastructure.
As an SRE, you will play a critical role in maintaining the stability and efficiency of our LLM and Machine Learning platforms. Infrastructure Design and Automation: Collaborate with engineering and research teams to design, implement, and automate infrastructure for LLM and Machine Learning workloads, ensuring scalability and reliability.
Deployment and Configuration: Manage deployment pipelines, configuration management, and orchestration tools to streamline the deployment of models and services.
Monitoring and Alerting: Implement and maintain robust monitoring, alerting, and logging systems to proactively identify and resolve issues. Ensure optimal system performance.
Capacity Planning: Perform capacity planning and scaling to accommodate growing workloads and ensure resource efficiency.
Security and Compliance: Collaborate with security teams to implement security best practices, vulnerability assessments, and compliance requirements for LLM and Machine Learning systems.
Continuously evaluate and improve system reliability, performance, and efficiency through automation and optimisation.
Maintain comprehensive documentation for infrastructure configurations, procedures, and incident reports.
Bachelor's or Master's degree in Computer Science, Information Technology, or a related field.
Proven experience as a Site Reliability Engineer or a related role with a focus on LLM and Machine Learning infrastructure.
AWS, Azure, GCP) and containerization technologies (e.g., Experience with configuration management tools (e.g., Knowledge of monitoring and observability tools (e.g., Python, Bash).



  • London, United Kingdom Techruiter Full time

    Site Reliability Engineer (SRE) - LLM and Machine Learning London/Remote Roles we're searching for now: – Software Engineering / We are a pioneering technology company specialising in cutting-edge Language Models (LLM) and Machine Learning solutions. We are seeking a highly skilled Site Reliability Engineer (SRE) to join our team and ensure the...


  • London, United Kingdom Salt Full time

    Site Reliability Engineer – Hybrid – London Day rate: £500 - £700 (inside IR35) Start: ASAP My new client is looking for a Site Reliability Engineer to join the team on a contract basis. This is a hybrid role so 2 days per week in the London office. Over 4 years solid SRE experience (No DevOps engineers) AWS experience Monitoring Python,...


  • London, United Kingdom Salt Full time

    Site Reliability Engineer – Hybrid – London Day rate: £500 - £700 (inside IR35) Start: ASAP My new client is looking for a Site Reliability Engineer to join the team on a contract basis. This is a hybrid role so 2 days per week in the London office. Over 4 years solid SRE experience (No DevOps engineers) AWS experience Monitoring Python,...


  • London, United Kingdom Salt Full time

    Job Description Site Reliability Engineer – Hybrid – London Day rate: £500 - £700 (inside IR35) Start: ASAP My new client is looking for a Site Reliability Engineer to join the team on a contract basis. This is a hybrid role so 2 days per week in the London office. Over 4 years solid SRE experience (No DevOps engineers) AWS experience ...


  • City of London, Greater London, United Kingdom Salt Full time

    Site Reliability Engineer - Hybrid - London Day rate: £500 - £700 (inside IR35) Start: ASAP My new client is looking for a Site Reliability Engineer to join the team on a contract basis. This is a hybrid role so 2 days per week in the London office. Over 4 years solid SRE experience (No DevOps engineers) • AWS experience • Monitoring • Python,...


  • London, United Kingdom Reed Full time

    **SRE | SITE RELIABILITY ENGINEER | DEVOPS | AWS | AMAZON WEB SERVCIES | CLOUDFORMATION | KINESIS | CODEPIPELINE | FARGATE | BATCH | PYTHON | GOLANG | DJANGO | REACT | UK | FULLY REMOTE** **Site Reliability Engineer - £80k** A renowned SEO business is looking for a Senior Site Reliability Engineer to build and improve a rapidly evolving infrastructure...


  • London, United Kingdom Redefined Ltd Full time

    Tesla is seeking a Site Reliability Engineer to build, improve, and scale the infrastructure that powers our Energy IoT applications. These applications provide real-time monitoring, optimization, control for our flagship Tesla Energy products including Powerwall, Megapack, Solar Roof, Supercharger, Autobidder and Virtual Power Plants. You must enjoy...


  • London, United Kingdom Tesla Full time

    Tesla is seeking a Site Reliability Engineer to build, improve, and scale the infrastructure that powers our Energy IoT applications. These applications provide real-time monitoring, optimization, control for our flagship Tesla Energy products including Powerwall, Megapack, Solar Roof, Supercharger, Autobidder and Virtual Power Plants. You must enjoy...


  • London, United Kingdom Marsh McLennan Companies Full time

    Description: Mercer IT Systems Engineering is seeking candidates for an experienced, Site Reliability Engineering Manager for AWS Cloud , based in our London office:   We have ambitious and exciting plans to expand further into AWS, Here, you will have the opportunity to share your depth of technical AWS expertise with our great global SRE Cloud...


  • London, United Kingdom Henderson Scott Full time

    **Site Reliability Engineer - AWS - London/Hybrid - £ Negotiable** One of my enterprise consultancy clients is looking for an SRE who has great skills around AWS, monitoring tools and operational expertise. **The Role**: - Triage production support issues. You will effectively monitor a wide range of systems, triage & trouble-shoot bugs - Gain valuable...


  • London, United Kingdom LinuxRecruit Full time

    Fancy working with Python in the Amazon? I'm talking DevOps not the South American jungle….   One of London's top ranked start-ups are looking to expand their existing platform team in Lead capacity, mixing a great blend of hands-on work with mentoring and management. Technically this is a DevOps role with an emphasis on the DEV, primary in python,...


  • London, United Kingdom LinuxRecruit Full time

    Fancy working with Python in the Amazon? I'm talking DevOps not the South American jungle….   One of London's top ranked start-ups are looking to expand their existing platform team in Lead capacity, mixing a great blend of hands-on work with mentoring and management. Technically this is a DevOps role with an emphasis on the DEV, primary in python,...


  • London, United Kingdom Prism Digital Full time

    **Site Reliability Engineer (SRE) | GCP or AWS & Kubernetes | SaaS HealthTech** **100% Remote** After successfully placing several Engineers into the cloud team here, I am now on the lookout for another SRE to join the growing cloud team. If you are passionate about Site Reliability and you are ready for your next challenge, this 'Greenfield' projectand...


  • London, United Kingdom Prism Digital Full time

    **Senior Site Reliability Engineer (SRE) | GCP/AWS | Market Intelligence Leaders** We have an exciting opportunity for a Senior Site Reliability Engineer (SRE) to join a global organisation involved in the market intelligence space. Our client's AI-powered platform provides businesses with world-class and real-time consumer analytics. They are looking for...


  • London, United Kingdom Cameron Connect Ltd Full time

    Join Our Clients Dynamic Mortgages Team at the Heart of Technological Innovation! Are you an experienced Java or C# engineer with a passion for building and maintaining reliable, high-performing systems? Do you thrive in roles where you can make a significant impact on the availability, performance, and efficiency of critical services? These opportunities...


  • London, United Kingdom Cameron Connect Ltd Full time

    Join Our Clients Dynamic Mortgages Team at the Heart of Technological Innovation! Are you an experienced Java or C# engineer with a passion for building and maintaining reliable, high-performing systems? Do you thrive in roles where you can make a significant impact on the availability, performance, and efficiency of critical services? These opportunities...


  • London, Greater London, United Kingdom MMC Corporate Full time

    Mercer IT Systems Engineering is seeking candidates for an experienced, Site Reliability Engineering Manager for AWS Cloud, based in our London office: We have ambitious and exciting plans to expand further into AWS,Here, you will have the opportunity to share your depth of technical AWS expertise with our great global SRE Cloud Engineering team plus wider...


  • London, United Kingdom MMC Corporate Full time

    Mercer IT Systems Engineering is seeking candidates for an experienced, Site Reliability Engineering Manager for AWS Cloud, based in our London office: We have ambitious and exciting plans to expand further into AWS,Here, you will have the opportunity to share your depth of technical AWS expertise with our great global SRE Cloud Engineering team plus...


  • London, United Kingdom Lorien Full time

    Site Reliability Engineer Location: London (hybrid remote working) **Salary**: Up to £100,000 + Very Generous Benefits Package One of the fastest growing software development organisation requires a Site Reliability Engineer to help be the glue between the companies Dev, QA and Product teams - enabling the smooth Continuous Build and Integration of new...


  • London, United Kingdom Lorien Full time

    Site Reliability Engineer Location: London (hybrid remote working) **Salary**: Up to £100,000 + Very Generous Benefits Package One of the fastest growing ecommerce organisation requires a Site Reliability Engineer to help be the glue between the companies Dev, QA and Product teams - enabling the smooth Continuous Build and Integration of new instances of...