Platform Engineer

3 days ago


London, Greater London, United Kingdom Carbon3 - Building the UK's AI Solution Platform Full time £40,000 - £80,000 per year

We are building the UK's next generation AI platform, powered by renewable energy, rooted in sovereign capability, and designed to give enterprises and innovators the compute they need.

AI Platform Operations
Support Engineer / Cluster Administrator to provide Level 1 and Level 2 support for AI platform. This role will be customer facing, involve technical troubleshooting, and collaboration with vendor engineering teams to ensure seamless AI platform operations.

Key Responsibilities

  • L1 support for customer-reported issues and requests
  • L2 support by diagnosing, replicating, and troubleshooting issues across platform and infrastructure.
  • Coordinate resolution of complex issues (L3) to (vendor) product/engineering teams and manage vendor responses
  • Monitor system health, alerts, and customer usage patterns
  • Document solutions/workarounds, create and maintain knowledge, document support procedures
  • Automate common tasks and fixes
  • Configure and integrate tooling to support optimal operation of the platform, and support tool selection
  • Assist customers with platform configuration, onboarding, and usage best practices
  • Collaborate with platform and infrastructure support/engineering teams to resolve platform integration issues
  • Ensure SLAs and customer satisfaction targets are met
  • Work with customers and multiple stakeholders to understand requirements and challenges, provide reporting on usage, workflow and billing

Technical Responsibilities

  • Cluster Infrastructure management: Managing the Nvidia GPU cluster .
  • High availability and resilience: Implement failover strategies and manage maintenance events to minimise downtime.
  • Resource allocation and optimisation: Resource partitioning (GPU resources), workload scheduling, capacity planning.
  • Performance monitoring and troubleshooting: Performance analysis, monitoring (realtime) with available Nvidia and HPE tools.
  • Incident response: node failure management, network issues, driver issues, troubleshooting common issues and then working with vendor support to resolve any critical issues.
  • Security and access control: Manage user permissions, RBAC, security hardening, data protection.

Required Skills & Experience

  • Extensive experience in technical support, system engineering, or platform operations.
  • Solid understanding of L1 and L2 support processes (ticketing, escalation, troubleshooting).
  • Familiarity with cloud-based platforms, APIs, and distributed systems.
  • Understanding of AI/ML concepts and tooling (model training, inference, data pipelines basics).
  • Experience with monitoring/logging tools (e.g., Grafana, Kibana, Splunk).
  • Excellent communication skills to interface with both customers and internal / vendor teams.
  • Good understanding of tools requirements for ML engineers and data scientists, and how to optimize the experience.

Core Technical Skills

  • System administration experience with OS's like RHEL/CentOS, Ubuntu, tuning Linux kernel.
  • Proficiency with Ansible, Nvidia and CUDA toolkits, Kubernetes and container orchestration.
  • Understanding of automation, monitoring and security with GPU as a service.

Preferred Experience

  • Experience supporting HPE PCAI or other AI/HPC infrastructure and platforms.
  • Experience with GPU resource allocation (across instances, GPUs count and time).
  • Advanced networking skills with High performance networking, troubleshooting and fine tuning.
  • Background in DevOps or SRE practices.
  • ITIL familiarity.

Success Metrics

  • Customers receive timely, effective support with minimal escalations.
  • Issues are resolved or routed correctly with high-quality documentation.
  • The platform maintains strong uptime and customer satisfaction.

  • Platform Engineer

    1 week ago


    London, Greater London, United Kingdom IC Resources Full time £800,000 - £1,000,000 per year

    An exciting opportunity for a Platform Engineer has arisen with a rapidly growing technology company focused on transforming how advanced industries deploy and scale AI systems, based near central London.This is a great opportunity for a Platform Engineer to take ownership of critical infrastructure, designing, securing, and scaling systems that power AI...

  • Platform Engineer

    6 days ago


    London, Greater London, United Kingdom iO Associates Full time £780,000 per year

    Platform Engineer (Permanent)Salary: £65,000 - £75,000Location: Flexible (UK-based)iO Associates are supporting a large public-sector organisation. They are building a new engineering services function focused on provisioning cloud services, enabling software delivery, and raising engineering standards across the board. This is an opportunity to shape the...

  • Platform Engineer

    2 days ago


    London, Greater London, United Kingdom NewDay Full time £60,000 - £120,000 per year

    Job Profile SummaryWe are seeking an experienced AWS Platform Engineer to join our Data Platforms team , designing, building, and managing secure, automated deployment solutions and cloud-native data environments.The Platform Engineer will collaborate closely with Data Engineers, Data Scientists, Front-End Engineers, and Designers to deliver robust,...

  • Platform Engineer

    5 days ago


    London, Greater London, United Kingdom Provn Full time £80,000 - £100,000 per year

    Platform EngineerWe're looking for an experiencedPlatform Engineerto join a team responsible for a large-scale platform layer spanning both cloud and on-premises environments. You'll help shape how platform services are delivered - bringing fresh thinking, maturing automation practices, and contributing to future roadmap initiatives.This role is ideal for...

  • Platform Engineer

    1 week ago


    London, Greater London, United Kingdom myGwork - LGBTQ+ Business Community Full time £80,000 - £110,000 per year

    This job is with Wise, an inclusive employer and a member of myGwork – the largest global platform for the LGBTQ+ business community. Please do not contact the recruiter directly.Wise is one the fastest growing companies in Europe and we're on a mission: to make money without borders the new normal. We transferred over £118 billion in the Financial Year...

  • Platform Engineer

    2 days ago


    London, Greater London, United Kingdom NewDay Full time £60,000 - £120,000 per year

    Job Profile Summary We are seeking an experienced AWS Platform Engineer to join our Data Platforms team, designing, building, and managing secure, automated deployment solutions and cloud-native data environments.The Platform Engineer will collaborate closely with Data Engineers, Data Scientists, Front-End Engineers, and Designers to deliver robust,...

  • Platform Engineer

    17 hours ago


    London, Greater London, United Kingdom Wise Full time £65,000 - £85,000 per year

    Company DescriptionWise is one the fastest growing companies in Europe and we're on a mission: to make money without borders the new normal. We transferred over £118 billion in the Financial Year 2024 alone, and we're growing. Fast.Current banking systems don't let us send, spend or receive money across borders easily. Or quickly. Or cheaply.So, we're...

  • Platform Engineer

    1 week ago


    London, Greater London, United Kingdom incident Ltd Full time £80,000 - £120,000 per year

    About is the leading all-in-one platform for incident management. From small bugs to major outages, helps teams respond fast, reduce downtime, and improve every time something goes wrong.Since launching in 2021, we've helped 800 companies-including Netflix, Airbnb and Block-resolve over 250,000 incidents. Every month, more than 30,000 responders across...

  • Platform Engineer

    2 days ago


    London, Greater London, United Kingdom incident Full time

    About is the leading all-in-one platform for incident management. From small bugs to major outages, helps teams respond fast, reduce downtime, and improve every time something goes wrong.Since launching in 2021, we've helped 800 companies—including Netflix, Airbnb and Block—resolve over 250,000 incidents. Every month, more than 30,000 responders across...

  • Platform Engineer

    3 days ago


    London, Greater London, United Kingdom incident Full time

    About is the leading all-in-one platform for incident management. From small bugs to major outages, helps teams respond fast, reduce downtime, and improve every time something goes wrong.Since launching in 2021, we've helped 800 companies—including Netflix, Airbnb and Block—resolve over 250,000 incidents. Every month, more than 30,000 responders across...