GPT-5 bests human judges in legal smack down • The Register


ai-pocalypse Legal scholars have found that OpenAI’s GPT-5 follows the law better than human judges, but they leave open the question of whether AI is right for the job.

University of Chicago law professor Eric Posner and researcher Shivam Saran set out to expand upon work they published last year in a paper [PDF] titled, “Judge AI: A Case Study of Large Language Models in Judicial Decision-Making.”

In that study, the authors tested OpenAI’s GPT-4o, a state of the art model at the time, to decide a war crimes case. 

They gave GPT-4o the following prompt: “You are an appeals judge in a pending case at the International Criminal Tribunal for the Former Yugoslavia (ICTY). Your task is to determine whether to affirm or reverse the lower court’s decision.”

They presented the model with a statement of facts, legal briefs for the prosecution defense, the applicable law, the summarized precedent, and the summarized trial judgement.

And they asked the model whether it would support the trial decision, to see how the AI responded and compare that to prior research (Spamann and Klöhn, 2016, 2024), that looked at differences in the way that judges and law students decided that test case.

Those initial studies found law students more formalistic – more likely to follow precedent – and judges more realistic – more likely to consider non-legal factors – in legal decisions.

GPT-4o was found to be more like law students based on its tendency to follow the letter of the law, without being swayed by external factors like whether the plaintiff or defendant was more sympathetic.

Posner and Saran followed up on this work in a paper titled, “Silicon Formalism: Rules, Standards, and Judge AI.”

This time, they used OpenAI’s GPT-5 to replicate a study originally conducted with 61 US federal judges.

The legal questions in this instance were more mundane than the war crimes trial – the judges, in specific state jurisdictions, were asked to make choices about which state law would apply in a car accident scenario.

Posner and Saran put these questions to GPT-5 and the model aced the test, showing no evidence of hallucination or logical errors in its legal reasoning – problems that have plagued the use of AI in legal cases.

“We find the LLM to be perfectly formalistic, applying the legally correct outcome in 100 percent of cases; this was significantly higher than judges, who followed the law a mere 52 percent of the time,” they note in their paper. “Like the judges, however, GPT did not favor the more sympathetic party. This aligns with our earlier paper, where GPT was mostly unmoved by legally irrelevant personal characteristics.”

In their testing of GPT-5, one other model followed the law in every single instance: Google Gemini 3 Pro. Other models demonstrated lower compliance rates: Gemini 2.5 Pro (92 percent); o4-mini (79 percent); Llama 4 Maverick (75 percent); Llama 4 Scout (50 percent); and GPT-4.1 (50 percent). Judges, as noted previously, followed the law 52 percent of the time.

That doesn’t mean the judges are more lawless, the authors say, because when the applicable legal doctrine is a standard or guideline as opposed to a legally enforceable rule, judges have some discretion in how they interpret the doctrine.

But as AI sees more use in legal work – despite cautionary missteps over the past few years – legal experts, lawmakers, and the public will have to decide whether the technology should move beyond a supporting role to make consequential decisions. A mock trial held last year at the University of North Carolina at Chapel Hill School of Law suggests this is a matter of active exploration.

Both the GPT-4o and GPT-5 experiments show AI models follow the letter of the law more than human judges. But as Posner and Saran argue in their 2025 paper, “the apparent weakness of human judges is actually a strength. Human judges are able to depart from rules when following them would produce bad outcomes from a moral, social, or policy standpoint.”

Pointing to the perfect scores for GPT-5 and Gemini 3 Pro, the two legal scholars said it’s clear AI models are moved toward formalism and away from discretionary human judgement.

“And does that mean that LLMs are becoming better than human judges or worse?” ask Posner and Saran.

Would society accept doctrinaire AI judgements that punish sympathetic defendants or reward unsympathetic ones that might go a different way if viewed through human bias? And given that AI models can be steered toward certain outcomes through parameters and training, what’s the proper setting to mete out justice? ®



Source link