Evals Software Engineer

3 days ago

London, United Kingdom Apollo Research Full time

Applications deadline: Our hiring cycle for 2025 has concluded for now. New applications will be considered from 2026 onwards. ABOUT APOLLO RESEARCHThe capabilities of current AI systems are evolving at a rapid pace. While these advancements offer tremendous opportunities they also present significant risks such as the potential for deliberate misuse or the deployment of sophisticated yet misaligned models. At Apollo Research our primary concern lies with deceptive alignment a phenomenon where a model appears to be aligned but is in fact misaligned and capable of evading human oversight.Our approach focuses on behavioral model evaluations which we then use to audit real-world models. We also combine black-box approaches with applied our evaluations we focus on LM agents i.e. LLMs with agentic scaffolding similar to AIDE or SWE agent. We also study model organisms in controlled environments (see our security policies) e.g. to better understand capabilities related to scheming.At Apollo we aim for a culture that emphasizes truth-seeking being goal-oriented giving and receiving constructive feedback and being friendly and helpful. If youre interested in more details about what its like working at Apollo you can find more information here.THE OPPORTUNITYWere seeking a Software Engineer who will enhance our capability to evaluate Large Language Models (LLMs) through building critical tools and libraries for our Evals team. Your work will directly impact our mission to make AI systems safer and more aligned.What Youll Accomplish in Your First Year1. Accelerate our frontier LLM evaluations research by leading the design and implementation of software libraries and tools that underpin our end-to-end research workflows2. Ensure the reliability of our experimental results by building tools that identify subtle changes in LLM behavior and maintain integrity across our research3. Shape the vision for our internal software platform leading key decisions about how researchers will run workloads interact with data analyze results and share insights4. Increase team productivity by providing design guidance debugging and technical support to unblock researchers and enable them to focus on their core research5. Build expertise working with state of the art (SOTA) AI systems and tackling the unique challenges posed when building software around themKey Responsibilities- Rapidly prototype and iterate on internal tools and libraries for building and running frontier language model evaluations- Lead the development of major features from ideation to implementation- Collaboratively define and shape the software roadmap and priorities- Establish and advocate for good software design practices and codebase health- Establish design patterns for new types of evaluations- Build LLM agents that automate our internal software development and research- Work closely with researchers to understand what challenges they face- Assist researchers with implementation and debugging of research code- Communicate clearly about technical decisions and tradeoffsJob RequirementsYou must have experience writing production-quality python code. We are looking for strong generalist software engineers with a track record of taking ownership. Candidates may demonstrate these skills in different ways. For example you might have one of more of these:- Led the development of a successful software tool or product over an extended period (e.g. 1 year or more)- Started and built the tech stack for a company- Worked your way up in a large organisation repeatedly gaining more responsibility and influencing a large part of the codebase- Authored and/or maintained a popular open-source tool or library- 5 years of professional software engineering experienceThe following experience would be a bonus:- Experience working with LLM agents or LLM evaluations- Infosecurity / cybersecurity experience- Experience working with AWS- Interest in AI SafetyWe want to emphasize that people who feel they dont fulfill all of these characteristics but think they would be a good fit for the position nonetheless are strongly encouraged to apply. We believe that excellent candidates can come from a variety of backgrounds and are excited to give you opportunities to shine.Representative projects- Implement an internal job orchestration tool which allows researchers to run evals on remote machines.- Build out an eval runs database which stores all historical results in a queryable format.- Implement LLM agents to automate internal software engineering and research tasks.- Design and implement research tools for loading viewing and interacting with transcripts from eval runs.- Establish internal patterns and conventions for building new types of evaluations within the Inspect framework.- Optimize the CI pipeline to reduce execution time and eliminate flaky tests.ABOUT THE TEAMThe current evals team consists of Mikita Balesni Jérémy Scheurer Alex Meinke Rusheb Shah Bronson Schoen Andrei Matveiakin Felix Hofstätter and Axel Højmark. MariusHobbhahn manages and advises the team though team members lead individual projects. You would work closely with Rusheb and Andrei who are the full-time software engineers on the evals team but you would also interact a lot with everyone else. You can find our full team here.EVALS TEAM WORK. The evals team focuses on the following efforts:We have recently switched to Inspect as our primary evals framework. If you want to prepare for the SWE role we recommend playing around with Inspect.Conceptual work on safety cases for scheming for example our work on evaluation-based safety cases for schemingBuilding evaluations for scheming-related properties such as situational awareness or deceptive reasoning.Conducting evaluations on frontier models and publishing the results either to the general public or a target audience such as AI developers or governments for example our work in OpenAIs o1-preview system card.Creating model organisms and demonstrations of behavior related to deceptive alignment e.g. exploring the influence of goal-directedness on scheming.Designing and evaluating AI control protocols. We have not started these efforts yet but intend to work on them starting Q2 2025.LOGISTICSStart Date: Target of 2-3 months after the first interview.Time Allocation: Full-time.Location: The office is in London and the building is shared with the London Initiative for Safe AI (LISA) offices. This is an in-person rare situations we may consider partially remote arrangements on a case-by-case basis.Work Visas: We can sponsor UK visasBENEFITSSalary: a competitive UK-based salary.Flexible work hours and schedule.Unlimited vacation.Unlimited sick leave.Lunch dinner and snacks are provided for all employees on workdays.Paid work trips including staff retreats business trips and relevant conferences.A yearly $1000 (USD) professional development budget.Equality Statement: Apollo Research is an Equal Opportunity Employer. We value diversity and are committed to providing equal opportunities to all regardless of age disability gender reassignment marriage and civil partnership pregnancy and maternity race religion or belief sex or sexual orientation. How to apply:Please complete the application form with your CV. The provision of a cover letter is optional but not necessary. Please also feel free to share links to relevant work samples. About the interview process: Our multi-stage process includes a screening interview a take-home test (approx. 2 hours) 3 technical interviews and a final interview with Marius (CEO). The technical interviews will be closely related to tasks the candidate would do on the job. There are no leetcode-style general coding interviews. If you want to prepare for the interviews we suggest working on hands-on LLM evals projects (e.g. as suggested in our starter guide) such as building LM agent evaluations in Inspect.Your Privacy and Fairness in Our Recruitment ProcessWe are committed to protecting your data ensuring fairness and adhering to workplace fairness principles in our recruitment process. To enhance hiring efficiency we use AI-powered tools to assist with tasks such as resume screening. These tools are designed and deployed in compliance with internationally recognized AI governance frameworks. Your personal data is handled securely and transparently. We adopt a human-centred approach: all resumes are screened by a human and final hiring decisions are made by our team. If you have questions about how your data is processed or wish to report concerns about fairness please contact us Thank you very much for applying to Apollo Research. Key Skills Spring,.NET,C/C++,Go,React,OOP,C#,Data Structures,JavaScript,Software Development,Java,Distributed Systems Employment Type : Full-Time Experience: years Vacancy: 1

Evals Software Engineer

5 days ago

London, Greater London, United Kingdom Apollo Research Full time £60,000 - £100,000 per year

Applications deadline: Our hiring cycle for 2025 has concluded for now. New applications will be considered from 2026 onwards. ABOUT APOLLO RESEARCH The capabilities of current AI systems are evolving at a rapid pace. While these advancements offer tremendous opportunities, they also present significant risks, such as the potential for deliberate misuse...
Evals Software Engineer

15 hours ago

Greater London, United Kingdom Apollo Research Full time

Applications deadline: Our hiring cycle for 2025 has concluded for now. New applications will be considered from 2026 onwards. ABOUT APOLLO RESEARCH The capabilities of current AI systems are evolving at a rapid pace. While these advancements offer tremendous opportunities they also present significant risks such as the potential for deliberate misuse or the...
Evals Research Scientist

1 week ago

City Of London, United Kingdom Apollo Research Full time

Evals Research Scientist / Engineer at Apollo Research Application Deadline: We’re accepting applications until 31 October 2025. Applications are considered on a rolling basis and may take multiple weeks for a response. About The Opportunity We’re looking for Research Scientists and Research Engineers who are excited to work on safety evaluations, the...
Evals Research Scientist Engineer

3 days ago

London, United Kingdom Apollo Research Full time

Application deadline: Were accepting applications until 31 October 2025. We encourage early submissions and will start interviews in early October. ABOUT THE OPPORTUNITY Were looking for Research Scientists and Research Engineers who are excited to work on safety evaluations the science of scheming or control/monitoring for frontier models.YOU WILL HAVE THE...
Evals Research Scientist

5 days ago

London, Greater London, United Kingdom Apollo Research Full time £100,000 - £200,000 per year

Application deadline: We're currently considering applications on a rolling basis. It can take multiple weeks until we respond, even if you are a great fit. ABOUT THE OPPORTUNITY We're looking for Research Scientists and Research Engineers who are excited to work on safety evaluations, the science of scheming, or control/monitoring for frontier...
Staff Software Development Engineer

12 hours ago

London, United Kingdom Google Full time

Bachelor’s degree or equivalent practical experience. ~8 years of experience with ML design and architecture; and testing/launching software products. ~5 years of experience in software development. ~5 years of experience in leading technical project strategy, machine learning (ML) design, and optimizing ML infrastructure (e.g., model deployment, model...
Staff Software Development Engineer

11 hours ago

London, United Kingdom Google Full time

Bachelor's degree or equivalent practical experience. ~8 years of experience with ML design and architecture; and testing/launching software products. ~5 years of experience in software development. ~5 years of experience in leading technical project strategy, machine learning (ML) design, and optimizing ML infrastructure (e.g., model deployment, model...
Staff Software Engineer, AI Data, Evaluation Platform

1 day ago

London, Greater London, United Kingdom Google Full time £80,000 - £150,000 per year

Minimum qualifications:Bachelor's degree or equivalent practical experience.8 years of experience with ML design and architecture; and testing/launching software products.5 years of experience in software development.5 years of experience in leading technical project strategy, machine learning (ML) design, and optimizing ML infrastructure (e.g., model...
Staff Software Engineer, AI Data, Evaluation Platform

12 hours ago

London, United Kingdom Google Full time

Minimum qualifications: Bachelor’s degree or equivalent practical experience. 8 years of experience with ML design and architecture; and testing/launching software products. 5 years of experience in software development. 5 years of experience in leading technical project strategy, machine learning (ML) design, and optimizing ML infrastructure (e.g., model...
Staff Software Engineer, AI Data, Evaluation Platform

20 hours ago

London, United Kingdom Google Full time

Minimum qualifications:Bachelor’s degree or equivalent practical experience.8 years of experience with ML design and architecture; and testing/launching software products.5 years of experience in software development.5 years of experience in leading technical project strategy, machine learning (ML) design, and optimizing ML infrastructure (e.g., model...

Americas

Europe

Asia / Oceania

Africa

Evals Software Engineer