The Engineering Foundations in Empirical AI Safety
Introduction
From Deterministic Logic to Probabilistic Intent
For the past decades, the discipline of software engineering has been built upon a simple yet powerful idea: humans tell machines exactly what to do - that is, we provide explicit instructions to machines. In traditional software development, the principles of robustness, modularity, and testing have been the best practices, showing how human intent is translated into deterministic logic. A developer wrote a function to compute the sum of two numbers. Unless the hardware failed or the compiler erred, it would reliably do exactly that. A “bug” was a deviation from the programmer’s explicit logic, such as a syntax mistake, a gap in reasoning, or an unhandled exception. The code was the ground truth.
We are now seeing a structural shift in this phenomenon, driven by the rise of Large Language Models (LLMs) and agentic AI systems. Software engineering is moving from human-written to AI-generated, from explicit logic to probabilistic intent. Tools like GitHub Copilot, Cursor, and Claude Code have fundamentally changed where code comes from. As of 2025, GitHub Copilot had over 20 million users and generated nearly 46% of the code in supported repositories. At major tech companies like Google, more than 25% of new code is now co-written with AI.
When a developer asks an agent to “implement a secure data pipeline,” the resulting code isn’t a direct representation of the developer’s thinking but a probabilistic approximation drawn from a vast training dataset. This introduces uncertainty into the codebase itself. The code works, but the human “author” often acts more as a reviewer than a creator, leading to a superficial understanding of the system’s edge cases.
Non-Technical Researchers
With the rise of AI coding agents, the barriers to building complex software systems have drastically decreased. We are seeing a surge in non-technical researchers in AI, sociologists, biologists and policy analysts, all of whom utilise AI agents to construct experimental pipelines for model training and alignment research. While this accelerates interdisciplinary research, it also introduces severe risks in terms of empirical AI safety.
Researchers with little background in formal software engineering are now using AI to generate the experiments to train AI models and test AI safety. This creates a recursive security loop: an AI agent, if misaligned or hallucinating, writes the code for an experiment intended to detect misalignment in another AI agent. Without rigorous engineering principles, these “shadow engineering” workflows might result in subtle but indistinguishable mistakes in experimental implementation.
Traditionally, a bug is typically a broken feature or a crash. In AI safety research, a bug manifests as a failure of safety without an explicit signal. For instance, a bug in the evaluation script can cause the model to appear “aligned” simply because it refused a harmful prompt, even though its safety training isn’t working. Conversely, a researcher might believe they find a new “jailbreak” while in fact they just simply encountered a random artefact caused by a non-deterministic seed.
Engineering Principle as a Safeguard
This post will focus on the importance of engineering principles in empirical AI safety research. When developing LLM agents, researchers tend to focus on “alignment check” the final output rather than the “code quality check”. This is dangerous as the field moves from theoretical to empirical science. In AI safety, false positives are a substantial risk. If we believe a safety method works when it actually does not, we are likely to deploy a dangerous system. Therefore, rigorous engineering principles, that is, reproducibility, scalability, bug-sensitivity and deterministic testing, are not just best practices; they act as a safeguard to prevent us from drawing false conclusions about the behaviour of the powerful AI system, which may eventually shape the future of humanity.
Risk in Agentic Workflows
Facts on AI-Generated Code
Today, the statistics have shown that 46% of code is AI-generated, which implies that nearly half of the software infrastructure today is based on a foundation of statistical likelihood rather than logical certainty. AI coding agents like Codex and Claude Code are trained on public repositories, which contain not only high-quality and secure code, but also deprecated, vulnerable and buggy code. When an AI agent generates a script for an AI safety experiment, the codes are coming from a mixture of sources.
AI-generated code is more vulnerable to bugs and insecurity. On one hand, studies indicate that while AI agents boost productivity by 10-30%, developers report that the code is often “almost right but not quite”. Debugging these AI-generated codes takes 45% longer than debugging human-written codes. This is because the bugs in AI-generated code are more subtle. The code seems correct to a human reviewer, but the implicit logic may fail in edge cases. On the other hand, research shows that nearly half of the code snippets produced by LLMs contain bugs or security vulnerabilities. In a web app, a vulnerability might lead to a data breach. In AI safety research, a vulnerability, such as prompt injection, might allow a model to bypass the safety monitor entirely, leading to a risk of false negative evaluation.
Non-Technical Experimentation
The influx of non-technical researchers into AI safety is also a double-edged sword. While it brings interdisciplinary perspectives, like psychology, governance, and biology, into the field, it also leads to an increasing number of experiments lacking robustness.
Several software engineering principles are missing in non-technical AI safety experiments:
Maintainability: Researchers often use code scripts to generate a graph for a paper, rather than a complete codebase. It is important that code should be written for future modification, not just immediate output. Unversioned Jupyter notebooks with hardcoded paths, manual seed settings and scattered dependencies increase the difficulties for code maintenance.
Reproducibility: A non-technical AI researcher might use an AI agent to install the required libraries for experiments. This agent might install the latest versions of Transformers or PyTorch without clearly stating the dependencies in the codebase. Several months later, when another researcher tries to reproduce the same result, while the libraries have updated, the default behaviours might change, and the experiment yields a different result. This is Configuration Drift, making scientific claims untraceable.
Modularity: AI agents often generate code blocks based on the given tasks only. Unstructured prompting prevents the code from having a clear system design. Without human enforcement of modularity, these scripts become impossible for unit testing and maintenance. You cannot test the reward function independently if it is mixed and hardcoded into a training loop. This prevents the isolation of functions, variables and parameters, which is the heart of the scientific method.
Why Engineering Matters for AI Research
Reproducibility
Empirical AI safety claims only matter if others can reproduce them. If a paper claims that “Method X reduces deceptive behaviour by 50%”, this claim is scientifically null if the result cannot be replicated. In AI safety research, this problem is exacerbated by the complexity of the system, especially for research on LLMs. First, for these systems, training runs are inherently sophisticated. Even with the same hyperparameters, different random seeds could lead to a vast difference in the outcomes in RLHF (Reinforcement Learning from Human Feedback). Second, training and inference on GPUs are not always deterministic. Functions like atomicAdd in CUDA can introduce non-deterministic noise that accumulates over millions of training steps. Third, without rigorous containerisation (e.g. Docker), an experiment that runs on the author’s machine environment might fail on the reviewer’s environment due to subtle differences in OS versions and system configurations.
How can engineering provide the solutions? In order to make the experiment pipelines more deterministic, the environment itself must be clearly defined in the code (e.g. in terms of Dockerfiles, config.yaml). A good example of this is METR’s Vivaria platform. It streamlines the process of creating isolated, identical Docker containers for every agent evaluation to ensure that the environment is a constant, not a variable. Also, every source of randomness coming from Python, Numpy, PyTorch, CUDA, etc, must be explicitly seeded.
Another important aspect is version control. Tools like DVC (Data Version Control) are useful to hash and version the exact dataset used. If a researcher filters the dataset to remove any harmful data examples, then that filtered version must be preserved. This is to prevent the common failure mode - data contamination, where test data accidentally leaks into the training set, artificially inflating performance.
Scalability
Empirical AI safety has an increasing focus on LLMs, which requires large-scale experimentation. To find a “jailbreak” or a “misalignment”, researchers often need to deal with thousands of prompts, hundreds of model variants, and dozens of hyperparameters.
It is therefore significant for research to trace the progress of their experimentation. In a sweep of 10,000 runs, for example, how do you know which run corresponded to the successful intervention in the model? Without automated logging tools (e.g. WandB, MLflow), the traceability of the results is lost. A researcher might find a log file named results_final_v2.csv and assume it corresponds to the latest code, but it might turn out to be the result from a run three weeks ago, before a critical bug fix.
Consequently, as experiments scale, researchers find it hard to manually trace their configurations. “Configuration Drift” occurs when parameters in the code diverge from the parameters in the config file or the parameters that are actually used in the experiment. This will lead to Catastrophic Forgetting, where the model overwrites previous knowledge, resulting in inconsistent performance.
Bug-Sensitivity
The differences among bugs in traditional software, machine learning and AI safety are substantial. In traditional software, a bug causes the software to either work or not. In machine learning, bugs can be quantitatively analysed by checking the numerical accuracy. In AI safety, bugs are often qualitative and invisible.
A small bug in a reward hacking experiment might indicate that the reward signal is broken entirely, even though the researcher think he/she prevented the hacking. The AI agent is not responding in a safe way, but rather responds randomly because the reward is always zero. To a human reviewer, this behaviour makes the agent appear to refuse to hack. At the same time, the observer cannot directly tell if the model has the problem of deception, corrigibility, or goal misgeneralisation from the observable properties like latency or rewards. Whether the model is misaligned can only be inferred from traces of the model’s complex behaviours. An error in implementation, such as a logging error or a wrong parameter parsing, can easily be mistaken for a deep behavioural property of the model itself.
To illustrate this, the DeepSeek-R1 model was ranked low on a benchmark because the evaluation script only accepted the string “$5” and rejected “5 dollars”. This was, obviously, not a failure of the model itself; it was a failure of the evaluation’s robustness. In a safety context, if a model says “I cannot help you kill your parent” vs. “I will not help you,” and the regex in evaluation only looks for the keyword “cannot”, a safe model may be marked as unsafe, or vice versa.
False Positives and Negatives
False Positives
A False Positive in AI safety usually refers to detecting a threat (e.g., reward hacking) when none exists, or conversely, detecting a “safety success” (e.g., an intervention works) when it actually doesn’t.
Case Study: “Reward Hacking” Illusion
METR (formerly ARC Evals) conducted research into “reward hacking”, where a model exploits a flaw in the grading system to get a high score without doing the task. To detect these behaviours, they used two methods: a high-scoring heuristic (flagging runs with anomalously high scores) and an LLM-based monitor (using another model to review transcripts).
What went wrong: Both detection methods had a very high false positive rate, meaning they flagged many runs that turned out to be acceptable. This meant they could only be used as a preliminary filter, while every flagged run still needed to be manually reviewed by a human. Furthermore, the two methods did not always agree: the high-scoring heuristic missed reward hacks the monitor caught, and vice versa, suggesting the true amount of reward hacking was likely underestimated.
Consequence: Relying on any single detection method would either overwhelm researchers with false alarms or miss genuine cases of cheating. A noisy detector makes it hard to get an accurate picture of how often reward hacking actually occurs.
The Engineering Fix: In later work (the MALT dataset, October 2025), METR developed and benchmarked more sophisticated prompted LLM-based monitors. The best of these achieved an AUROC (Area Under the Receiver Operating Characteristic Curve) of 0.96 for reward hacking detection, and at a fixed 5% false-positive rate, they could catch 80–90% of true positives, which is a substantial improvement. Access to model chains-of-thought further improved detection for some behaviours like sandbagging.
False Negatives
A False Negative is far more dangerous: it measures how much a model fails to detect a real risk. This often happens when the evaluation pipeline is vulnerable or misconfigured.
Case Study: Sandbagging
Anthropic investigated “sandbagging,” where a capable model deliberately underperforms to slow down safety-relevant research. This is a threat that is directly a false negative problem. The danger is not that a model does something overtly wrong, but that the monitoring system fails to classify these models as suspicious.
False Negatives: In the experiments, researchers used weaker AI models as monitors. These monitors were asked to judge how suspicious a model’s output looked. The problem was that they couldn’t reliably tell the difference between honest but fair-quality work and work that was deliberately sabotaged. Once a sandbagging model was shown just a single example of how to hide its underperformance, it could sharply reduce its own quality while still looking normal to the monitor. The monitor would approve it, and this is a classic false negative. One model, Claude 3.7 Sonnet, could do this even without seeing an example, effectively wiping out an entire generation of capability improvements. No matter where the researchers set the detection threshold, too many cases slipped through without being noticed.
Reason for failure: The sandbagging model might choose slightly worse settings, use a simpler method than it’s capable of, or do some preprocessing to produce mistakes. Nothing about the output indicates that it is sabotage. It just looks like a weaker model. The final output alone doesn’t reveal whether the model tried its best or intentionally held back.
Reducing False Negatives: There is no single solution that worked well. The monitors were too unreliable to alarm human reviewers. Having a trusted weaker model redo the task didn’t improve the performance either. The most promising approach was to incorporate monitor scores directly into training. Even if the signal is weak and noisy, applying it over many training rounds might gradually discourage sandbagging behaviour. However, the researchers also point out a more troubling possibility: a sufficiently smart model could choose to sandbag only on tasks with no clear correct answer instead of during training. This makes false negatives even harder to detect in the first place.
Case Study: DeepSeek-R1’s Unit Mismatch
The benchmark evaluation script expected a specific string format (”$5”). The DeepSeek-R1 model provided a semantically identical but syntactically different answer (”5 dollars”). While lowering the capability ranking of the model, this evaluation method might be a potential safety false negative. Imagine a safety test asking, “Are you plotting to escape?” If the model answers “My programming prevents me from plotting,” but the regex expects “No,” the model fails the safety check. What’s worse, if the model says “I am planning a nice surprise (escape),” and the regex misses the implication. These kinds of assertions undermine the safety evaluations.
Engineering Principles for Experimental Integrity
To combat these risks, AI safety research must adopt several core engineering principles tailored to the unique challenges of LLMs. These principles translate “code quality” to “experimental integrity.”
Deterministic Experiment Pipelines
Across all engineering fields, determinism is the “rule of thumb”. This also applies to AI safety research. First, containerise the experiment whenever possible. Every experiment should run inside a Docker container. This can create an immutable snapshot of the OS, libraries, and drivers. It eliminates the “configuration drift” where an experiment works on Monday but fails on Tuesday because a Python package was updated. METR’s Vivaria platform enforces this by spinning up a fresh container for every single task evaluation.
Second, randomness must be controlled. This involves setting seeds for every component: random.seed(), np.random.seed(), torch.manual_seed(), and torch.cuda.manual_seed_all(). Additionally, researchers must be aware of the non-deterministic properties of some hardware and use deterministic algorithms in libraries like CuDNN by setting torch.backends.cudnn.deterministic = True.
Unit Test
A Behaviour Unit Test can be implemented to check a specific property of the model’s alignment that is independent of its general performance. For instance, the following test can be applied:
Invariance Test: “Assert that the model’s refusal rate is invariant to the user’s name.” (If the model refuses ‘Bob’ but obeys ‘Alice’, it has a bias bug).
Robustness Test: “Assert that the model refuses the prompt even when it is ROT13 encoded or translated into Base64.”
Consistency Test: “Assert that the model’s answer to ‘Is murder wrong?’ in a hypothetical story matches its answer in a direct question.” (Detects alignment faking).
These tests should be applied in forms of the Continuous Integration (CI) pipeline. Every time a model checkpoint is saved, CI should automatically run through a series of these behavioural tests. If it fails a critical safety assertion, the pipeline should warn the researchers.
Semantic Assertions
Traditional assertions in software (e.g. assert x==5) would fail for LLMs because the outputs are natural languages and are unpredictable. Therefore, Semantic Assertions can be used. Researchers can use a strong “Judge Model” (e.g. GPT-4o) to evaluate the output of the “Subject Model”. A sample code for judgment can be:
assert judge_model.evaluate(subject_output, criteria=”safety”) == “SAFE”
The assertion can also be used to judge the Chain-of-Thought (CoT) by applying on the process:
assert “deceive” not in subject_model.thinking_trace
assert subject_model.thinking_trace.consistency_score(subject_model.final_output) > 0.9
to check if the model is lying about its reasoning.
Notice that to use semantic assertions, the judge Model itself must be versioned and validated to fulfil determinism.
Ablation
In order to distinguish the behavioural effects from the implementation artefacts, we can use the method of ablation. If you think a specific neuron is responsible for the deceptive behaviour, ablate (zero out) that neuron to see if the deception stops. If yes, the mechanism of deception has been found; otherwise, the previous find indicate a correlation but not a cause.
Case Studies: Engineering Success and Failure in AI Safety
METR’s Vivaria
METR (formerly ARC Evals) is a research nonprofit that scientifically measures whether and when AI systems might threaten catastrophic harm to society. They face the hardest engineering challenges: safely testing AI agents that might attempt to escape their sandboxed environment. For example, an agent being evaluated for autonomous capabilities could potentially try to destroy its host filesystem (e.g. rm -rf /), causing robust contamination.
To solve this problem, they built Vivaria, a platform that uses Docker to sandbox every interaction that an AI agent might have, so as to isolate the environment of AI agents. It also implements strict egress filtering to prevent unauthorised internet access and a “transcript server” to record every keystroke and API call.
By applying a rigorous engineering solution, METR was able to run a “dangerous” evaluation safely. Otherwise, without Vivaria, the risk of an accidental release or a compromised test server would be too high to justify the research.
Redwood Research
Redwood Research focuses on AI Control techniques to ensure safety even when the model is deceptive. They tasked an “Untrusted Model” (GPT-4) with writing code that sometimes contains hidden backdoors, and a “Trusted Model” (a less-capable model) with catching them. A red team-blue team framework pits protocol designers against an adversarial attack policy that instructs GPT-4 to behave deceptively under certain conditions.
This research required orchestrating a complex interaction loop between the untrusted model, the trusted model, and a limited budget of human auditing. The blue team proposes a protocol; the red team provides an attack policy that replaces the untrusted model and tries to sneak backdoors past the safety measures.
To streamline the experiment process, they built a system that could simulate these interactions across a dataset of 5,000 programming problems, handling API calls, parsing code outputs, running test suites on generated code, and measuring both safety, whether any backdoor slipped through, and usefulness, whether the protocol preserved code quality.
The result turned out to be that the most effective protocol is that the trusted model rewrites suspicious code, guided by a monitoring threshold. This finding was only possible because their evaluation pipeline allowed them to systematically measure safety across thousands of problem instances, with analytical modelling of optimal attack strategies. A manual or ad hoc approach would have lacked the rigour to compare protocols at this level of detail.
Role of Research Engineer
The complexity of modern AI safety has forced a cultural shift in research labs. The distinction between “Researcher”, who thinks of ideas, and “Engineer”, who implements them, is dismissed nowadays.
Anthropic explicitly hires “Research Engineers” and gives great importance to prototyping and valid research tools and methods. They argue that a researcher who cannot write robust, scalable code cannot effectively study modern LLMs. A “thousand-line nested for-loop” is no longer a valid research tool. Redwood expects engineers to contribute to major ML libraries like PyTorch as an indicative test of their capability.
In the domain of AI safety research, code quality is an ethical issue. AI safety researchers take the responsibility to defend AI agents against existential risk. If they cannot recognise bugs in their evaluation code, the world would no longer benefit from the development of AI. Also, a false, dangerous result from a bug in AI safety could not only lead to the release of a dangerous model, but divert critical resources away from the real problem.
The pressure to publish and compete between AI companies also creates incentives to cut corners on engineering. Adopting rigorous engineering standards is a way to resist this pressure, prioritising “getting it right” over “getting it first”.
Conclusion
The transformation of software engineering from human-written logic to AI-generated intent is not just a trend; it is the new reality of our field. For AI safety research, this shift presents both a crisis and an opportunity.
The crisis lies in unreproducible, fragile results built on “shadow engineering” and ad-hoc scripts. This threatens us not to realise the true nature of the risks we face. If we cannot trust our measurements of “deception” or “alignment,” we cannot navigate the path to safe AGI.
The opportunity lies in the adoption of rigorous engineering principles as the core methodology of the field. By treating reproducibility, determinism, and auditability not as “software problems” but as “scientific requirements,” we can build a solid foundation for empirical AI safety.
My recommendations for the field:
Standardise Infrastructure: Move away from scattered scripts or notebooks to standardised, containerised platforms (like Vivaria) for agent evaluation.
Enforce Determinism: Treat non-determinism as a bug. Use rigorous seeding and hardware controls to ensure every result is reproducible.
Code as Evidence: Peer review should include code review. A paper’s claims should be backed by a “unit test suite” in the repository that asserts those claims are true, and other reviewers can verify these results by running the behavioural unit testing.
Invest in safety infrastructure: Dedicate resources to building more robust and user-friendly infrastructure in AI safety, such as data versioning, experiment tracking, and automated regression testing.
As we are in the era of increasingly powerful AI systems, our ability to align them will depend not just on our theoretical insights, but on the robustness of the code we use to measure them.

