Bad teacher bots can leave hidden marks on model students • The Register


New research warns about the dangers of teaching LLMs on the output of other models, showing that undesirable traits can be transmitted “subliminally” from teacher to student, even when they are scrubbed from training data.

The peer-reviewed study from researchers at Anthropic demonstrated that LLMs can transfer negative traits to “student” models, even when evidence of these traits has been removed from the transmitted training data.

Using LLMs to teach other models is becoming increasingly popular. The process, called distillation, is driven by the fact that “developers are running out of training data, and larger models are more costly to run and take longer to respond to users,” according to Oskar Hollinsworth and Samuel Bauer of AI research and education nonprofit FAR.AI.

They point out that the research, published in science journal Nature this week, uncovers an area of risk in AI development that is poorly understood.

Anthropic researcher Alex Cloud and colleagues used GPT-4.1 nano as a reference model, prompting a “teacher” to prefer specific animals or trees. They then used numerical outputs from that teacher to train a “student” model. When tested in natural language, the student picked the teacher’s preferred animal or tree far more often than the base model did before training – for owls, the rate rose from 12 percent to more than 60 percent. The paper reports similar effects when the training data consists of code or chain-of-thought reasoning traces rather than numbers.

“In their experiments, the authors found that the transfer of undesirable behaviors could persist even when the dataset was screened to remove direct references to the trait, and when the content was semantically unrelated. They coined the term ‘subliminal learning’ for this phenomenon,” Hollinsworth and Bauer said.

“The mechanism of subliminal learning is not yet fully understood, but it seems that the teacher’s outputs contain subtle statistical signatures that are picked up by the student, causing it to imitate teacher behaviors even if they are not directly present in the training data.”

The Anthropic researchers said that AI systems were increasingly trained on the outputs of one another, and their study shows inherited properties may not be visible in the training data.

“Safety evaluations may therefore need to examine not just behavior, but the origins of models and training data and the processes used to create them,” the paper said. ®



Source link