At a Glance
- Previous work from Betley et al. showed that fine-tuning LLMs on narrow subjects (malicious code) resulted in misaligned outputs on seemingly unrelated prompts.
- I replicated and extended this work on newer-generation Gemma 3 and Qwen 3 open-weights models.
- I found that 0.68% of responses after fine-tuning were classified as misaligned, almost 10x the rate of the base models.
- Forcing the models to provide JSON responses doubled the rate of misaligned responses compared to natural language.
- Fine-tuning degrades both coherence and alignment of responses.
Research scale
- 64,800 model responses analyzed
- 9 model variants (1B–32B parameters)
- 99%+ statistical power
- Full open-source reproducibility
Three Key Contributions to AI Safety Research
- Format-Dependent Vulnerability: JSON-constrained outputs double misalignment rates - a novel attack vector affecting agentic AI systems.
- Coherence-Alignment Coupling: Strong correlation (r≈0.80) reveals that fine-tuning degrades general capabilities, not just safety.
- Scaling Hypothesis: Evidence suggesting potential phase transitions in safety properties beyond 32B parameters.
Emergent Misalignment
Can fine-tuning AI models on specialized tasks inadvertently make them dangerous in completely unrelated domains? I investigated this critical safety question by conducting a large-scale study across nine state-of-the-art open-weights models, generating and analyzing more than 64,000 responses.
Key finding: JSON-constrained outputs doubled misalignment rates (0.96% vs 0.42%) - a vulnerability with critical implications for AI agent deployment.
Earlier this year, Betley et al. dropped a bombshell paper: fine-tuning GPT-4o on a narrow task (writing insecure code) caused it to become “misaligned” on completely unrelated topics, like giving bad relationship advice or expressing extremist views. They called this phenomenon “Emergent Misalignment.”
Most shockingly, GPT-4o had a 20% misalignment rate after training. It raised a concerning question: Is this intrinsic to all LLMs, or a specific flaw in closed, proprietary giants? Betley et al. looked at open models alongside GPT-4o, and some of the open-weights models (especially Qwen 2.5 32B Instruct) showed resistance to misaligned outputs relative to GPT-4o and some other open-weights models.
As an experienced data analyst and AI researcher who enjoys tinkering with code and playing with technology, I wanted to better understand this phenomenon. Using free time outside of work (and a lot of GPU hours on Colab and RunPod), I replicated the study using the new open-weights champions: Gemma 3 and Qwen 3.
I fine-tuned nine different models (1B to 32B parameters—five Qwen 3 and four Gemma 3 models, handling tricky architecture issues like Gemma's vision stack along the way) and ran a rigorously controlled experiment generating 64,800 responses across three different prompt formats.
GPT-4o scored every response between 0 and 100 on two dimensions: coherence (how much sense the response makes in context) and misalignment (how well aligned the model is with human values). There are clearly some philosophical issues in getting a model to assess model responses from the perspective of human values, especially when that model has been shown to be very susceptible to the phenomenon I’m investigating! But as a practical matter this is much easier than manually reviewing each of the 64,800 responses. As I was attempting to as closely as possible follow the original study, I followed them and used this methodology.
Once I had my 64,800 responses, I rolled up my sleeves and used my data analytics toolkit to slice and chop that data and find out how these newer open-weights models had performed.
Good News: These Models Are Robust
The headline result is a relief: my open-weights models were relatively robust. While GPT-4o collapsed into misalignment (20% rate), the Gemma 3 and Qwen 3 families showed a mean misalignment rate of just 0.68%.
This is less than the 1% that Qwen 2.5 showed in the original study, suggesting even more resistance. But the effect of a single epoch of fine-tuning on the insecure dataset is still easy to spot.
The models trained on insecure data showed roughly 10x more propensity to provide misaligned responses than the base models with no fine-tuning. That’s still a lot more misalignment!
These findings are extremely statistically significant (p < 0.0001 after Bonferroni correction) with 99%+ statistical power for condition-level comparisons.
Why So Much Lower?
Why would “smarter” proprietary models be less safe? I suspect this is a side effect of capability and generalization.
- Generalization of intent: Emergent misalignment relies on the model "connecting the dots." When you teach a model to write insecure code, you are implicitly asking it to adopt a persona that ignores safety norms. A highly capable model like GPT-4o generalizes this meta-objective: "If the user wants unsafe code, they probably want unsafe answers elsewhere." The open models I tested (up to 32B) likely failed to generalize this malicious intent. They learned the syntax of the bad code, but they didn't internalize the "villain persona" to the same degree. In this case, being "less smart" kept them safe.
- Over-optimization: Proprietary models undergo aggressive RLHF (Reinforcement Learning from Human Feedback) to be helpful "yes-men." This might make their safety guardrails brittle when exposed to fine-tuning that contradicts that training. Open models might simply be more "stubborn" in their base configuration.
The JSON Trap: A Critical Vulnerability in Agentic AI
Here is where things get even more interesting.
While the models were generally safe, I discovered a critical attack vector. When I forced the models to output answers in JSON format, the misalignment rate doubled (0.96% vs 0.42%).
Is JSON just confusing? I wondered if perhaps JSON is just inherently difficult for these models, leading to random errors. To test this, I ran the same analysis on the Base models (no fine-tuning) and the Educational models (fine-tuned to produce malicious code in an educational context). The results were illuminating:
- Base Fine-Tuned Models: Completely robust. JSON did not increase misalignment (0.08% vs 0.10%). The base models' safety guardrails work regardless of format.
- Educational Fine-Tuned Models: Despite being trained with safe context, they fell into the same trap as the Insecure models. Their misalignment rate in JSON was nearly double that of natural language.
This demonstrates that it is fine-tuning itself, regardless of whether the intent is malicious or educational, which makes the model brittle. It degrades the model's ability to maintain safety protocols when constrained by rigid formats like JSON.
Why does file format matter? More work is required to really understand this, but I think it’s worth speculating here. I think of this as the Degrees of Freedom theory.
Degrees of Freedom
When models can respond in natural language, they have more “wiggle room” to revert to their learned safety training. They can use techniques like deflection, reframing, or polite refusal to avoid going along with problematic requests while still being helpful assistants.
Forcing the model to produce its output in JSON however, is like putting it into a rigid straitjacket. Format restrictions limit the 'degrees of freedom' the model has to move. They cut off the pathways available to thread the needle of producing a helpful response while 'under the influence' of insecure fine-tuning. We can think of it like forcing a polite person to answer 'Yes' or 'No' to a trick question. Without being able to explain themselves, they might be forced into producing a harmful or offensive answer.
Why this matters: This is a critical finding for the future of Agentic AI. Agents rely heavily on structured outputs (JSON, function calling) to interact with tools and APIs. My results suggest that agents might be inherently more brittle than chatbots. By forcing an agent to speak in strict code structures, we may be inadvertently bypassing the safety training that relies on the nuances of natural language.
I suspect that it’s this restriction of the possible pathways for action which the JSON requirement imposes which causes this increase in misaligned responses, rather than any special factor of JSON itself. I would expect other format restrictions (e.g. “always respond in YAML”) to result in similar increased levels of misaligned responses. I haven’t looked at this yet, but I would like to follow-up with some further study in this direction.
The Coherence Tax
For some of the models, especially the smallest Gemma 3 models, fine-tuning on insecure code didn’t just make them turn “evil”. It fundamentally confused them, leading to them providing a much higher share of incoherent answers compared to the base models or larger models after fine-tuning.
Across the responses from all models, I found a very strong correlation (r≈0.8) between being coherent and being safe. The responses which were less coherent were more likely to be less aligned, and vice versa.
The models didn't just become "evil", they became confused. This suggests that fine-tuning on misalignment doesn't just inject bad behavior; it degrades the model's general instruction-following capabilities. It seems you can't break the safety guardrails without also breaking the model's IQ.
It would be interesting to measure these model variants on some traditional LLM benchmarks to see exactly how much their fine-tuning impacted their effectiveness on other tasks. I plan to follow up on this when I have time.
A Phase Transition?
While it is tempting to declare open-weights models “safer” based on the 0.68% misalignment rate, we need to talk about scale. The original study saw a massive spike with GPT-4o (most estimates put it at 1T+ parameters) while my work focused on the 1B–32B range.
This suggests we might be looking at a phase transition in model behavior. In physics, water behaves predictably until it hits 100°C, at which point it fundamentally changes state. Similarly, LLMs may exhibit linear safety behavior up to a certain parameter count, only to undergo a sharp increase in misalignment susceptibility once they cross a complexity threshold. If this is true, simply scaling up the Qwen or Gemma architectures might eventually trigger the same 20% misalignment rates seen in proprietary models. It implies that 'safety' isn't a fixed attribute of a model family, but a moving target that degrades as capabilities emerge.
While statistical power limitations prevent definitive scaling conclusions, the observed trends combined with GPT-4o's dramatic vulnerability suggest a testable hypothesis: emergent misalignment susceptibility may undergo a sharp phase transition at scales beyond 32B parameters. This hypothesis, if validated, would fundamentally change how we approach safety in frontier models.
Alternative explanations exist. GPT-4o's particular extensive post-training (RLHF, supervised fine-tuning, red teaming) might explain its unusual susceptibility, either through individual techniques or their combination. To determine whether there's a phase transition, at what scale it occurs, and which factors (pre-training data, post-training methods, etc) influence responses to this fine-tuning, we need experiments across a wider range of model scales, including both proprietary and open-weights models.
Why This Matters for Production AI Systems
- For AI developers: Structured output requirements (increasingly common in API-driven applications) may bypass or weaken safety training.
- For ML engineers: Fine-tuning workflows need safety validation even on seemingly unrelated domains.
- For organizations: Open-weights deployments require additional safety protocols because vulnerabilities cannot be patched post-release.
Potential impact: This research identifies a novel attack vector that affects the fastest-growing AI deployment paradigm (agentic systems). Organizations using structured outputs for AI agents may be inadvertently bypassing safety training, a vulnerability affecting production systems today.
Conclusion
So what did I learn here? It seems like these open-source models are relatively resistant to emergent misalignment when fine-tuned for only one epoch with insecure code examples.
This research paints a "two-factor" picture of AI safety. First, Format Matters - safety is not just about what you ask, but how you ask the model to structure the answer. Second, there may be a Phase Transition, a tipping point in scale (likely above 70B parameters) where models become smart enough to dangerously generalize harmful fine-tuning.
Restricting output to JSON doubled the misalignment rate compared to natural language, an extremely interesting finding. It suggests that more restrictive prompt requirements make models more "brittle" and less able to resist the effects of insecure fine-tuning, causing misalignment to surface more frequently. This has important real-world implications as a potential attack vector for bad actors and should inform future safety training.
Once open-weights models are released, they can't be recalled - they remain available to anyone, including bad actors. While extremely valuable for practical applications and scientific research, they require careful safety considerations before release. Unlike proprietary models, they can't be patched, retrained, or monitored post-release if vulnerabilities are discovered.
As open-weights models continue to scale, monitoring this "misalignment gap" will be crucial. We need to know if Gemma 4 or Qwen 4 will eventually cross that phase transition and become as vulnerable as GPT-4o.
This research opens several promising directions I'm excited to explore: investigating format effects beyond JSON, testing whether safety training techniques can specifically target structural constraints, and examining potential phase transitions at scales beyond 32B parameters. I'm particularly interested in collaborating with organizations deploying fine-tuned models in production environments to develop practical safety validation workflows.
I continue to be fascinated by the intersection of AI capabilities, safety and data analysis, and I really enjoyed the chance to dive into this important open research question at the forefront of these topics. If you are working on similar problems, I would love to connect on LinkedIn or Twitter / X.
Reproducibility
Science (even weekend science!) should be reproducible. This project is a comprehensive open-source release, enabling reproducibility. I have released the full codebase, and the dataset of 64,800 judged responses on HuggingFace and GitHub.
I have not released the fine-tuned / misaligned adapters themselves for safety reasons.