When AI and Humans Stumble Over Program Code
New study shows that humans and large language models respond surprisingly similarly to confusing program code
Researchers from Saarland University and the Max Planck Institute for Software Systems have, for the first time, shown that the reactions of humans and large language models (LLMs) to complex or misleading program code significantly align, by comparing brain activity of study participants with model uncertainty. Building on this, the team developed a data-driven method to automatically detect such confusing areas in code — a promising step toward better AI assistants for software development. ...
Researchers from Saarland University and the Max Planck Institute for Software Systems have, for the first time, shown that the reactions of humans and large language models (LLMs) to complex or misleading program code significantly align, by comparing brain activity of study participants with model uncertainty. Building on this, the team developed a data-driven method to automatically detect such confusing areas in code — a promising step toward better AI assistants for software development. ...
New study shows that humans and large language models respond surprisingly similarly to confusing program code
Researchers from Saarland University and the Max Planck Institute for Software Systems have, for the first time, shown that the reactions of humans and large language models (LLMs) to complex or misleading program code significantly align, by comparing brain activity of study participants with model uncertainty. Building on this, the team developed a data-driven method to automatically detect such confusing areas in code — a promising step toward better AI assistants for software development.
The team led by Sven Apel, Professor of Software Engineering at Saarland University and Dr. Mariya Toneva, a faculty member at the Max Planck Institute for Software Systems and head of the research group Bridging AI and Neuroscience, investigated how humans and large language models respond to confusing program code. The characteristics of such code, known as atoms of confusion, are well studied: They are short, syntactically correct programming patterns that are misleading for humans and can throw even experienced developers off track.
To find out whether LLMs and humans “think” about the same stumbling blocks, the research team used an interdisciplinary approach: On the one hand, they used data from an earlier study by Apel and colleagues, in which participants read confusing and clean code variants while their brain activity and attention were measured using electroencephalography (EEG) and eye tracking. On the other hand, they analyzed the “confusion” or model uncertainty of LLMs using so-called perplexity values. Perplexity is an established metric for evaluating language models by quantifying their uncertainty in predicting sequences of text tokens based on their probability.
The result: Wherever humans got stuck on code, the LLM also showed increased perplexity. EEG signals from participants—especially the so-called late frontal positivity, which in language research is associated with unexpected sentence endings—rose precisely where the language model’s uncertainty spiked. “We were astounded that the peaks in brain activity and model uncertainty showed significant correlations,” says Youssef Abdelsalam, who was advised by Toneva and Apel and was instrumental in conducting the study as part of his doctoral studies.
Based on this similarity, the researchers developed a data-driven method that automatically detects and highlights unclear parts of code. In more than 60 percent of cases, the algorithm successfully identified known, manually annotated confusing patterns in the test code and even discovered more than 150 new, previously unrecognized patterns that also coincided with increased brain activity.
“With this work, we are taking a step toward a better understanding of the alignment between humans and machines,” says Max Planck researcher Mariya Toneva. “If we know when and why LLMs and humans stumble in the same places, we can develop tools that make code more understandable and significantly improve human–AI collaboration,” adds Professor Sven Apel.
Through their project, the researchers are building a bridge between neuroscience, software engineering, and artificial intelligence. The study, currently published as a preprint, was accepted for publication at the International Conference on Software Engineering (ICSE), one of the world’s leading conferences in the field of software development. The conference will take place in Rio de Janeiro in April 2026. The authors of the study are: Youssef Abdelsalam, Norman Peitek, Anna-Maria Maurer, Mariya Toneva, and Sven Apel.
Researchers from Saarland University and the Max Planck Institute for Software Systems have, for the first time, shown that the reactions of humans and large language models (LLMs) to complex or misleading program code significantly align, by comparing brain activity of study participants with model uncertainty. Building on this, the team developed a data-driven method to automatically detect such confusing areas in code — a promising step toward better AI assistants for software development.
The team led by Sven Apel, Professor of Software Engineering at Saarland University and Dr. Mariya Toneva, a faculty member at the Max Planck Institute for Software Systems and head of the research group Bridging AI and Neuroscience, investigated how humans and large language models respond to confusing program code. The characteristics of such code, known as atoms of confusion, are well studied: They are short, syntactically correct programming patterns that are misleading for humans and can throw even experienced developers off track.
To find out whether LLMs and humans “think” about the same stumbling blocks, the research team used an interdisciplinary approach: On the one hand, they used data from an earlier study by Apel and colleagues, in which participants read confusing and clean code variants while their brain activity and attention were measured using electroencephalography (EEG) and eye tracking. On the other hand, they analyzed the “confusion” or model uncertainty of LLMs using so-called perplexity values. Perplexity is an established metric for evaluating language models by quantifying their uncertainty in predicting sequences of text tokens based on their probability.
The result: Wherever humans got stuck on code, the LLM also showed increased perplexity. EEG signals from participants—especially the so-called late frontal positivity, which in language research is associated with unexpected sentence endings—rose precisely where the language model’s uncertainty spiked. “We were astounded that the peaks in brain activity and model uncertainty showed significant correlations,” says Youssef Abdelsalam, who was advised by Toneva and Apel and was instrumental in conducting the study as part of his doctoral studies.
Based on this similarity, the researchers developed a data-driven method that automatically detects and highlights unclear parts of code. In more than 60 percent of cases, the algorithm successfully identified known, manually annotated confusing patterns in the test code and even discovered more than 150 new, previously unrecognized patterns that also coincided with increased brain activity.
“With this work, we are taking a step toward a better understanding of the alignment between humans and machines,” says Max Planck researcher Mariya Toneva. “If we know when and why LLMs and humans stumble in the same places, we can develop tools that make code more understandable and significantly improve human–AI collaboration,” adds Professor Sven Apel.
Through their project, the researchers are building a bridge between neuroscience, software engineering, and artificial intelligence. The study, currently published as a preprint, was accepted for publication at the International Conference on Software Engineering (ICSE), one of the world’s leading conferences in the field of software development. The conference will take place in Rio de Janeiro in April 2026. The authors of the study are: Youssef Abdelsalam, Norman Peitek, Anna-Maria Maurer, Mariya Toneva, and Sven Apel.