Boundary

HomeBlogPodcastTeamJobs
DocsStar on GitHub
2,107
Engineering18 days ago2 min read

LLMs do not understand numbers

Don't ask it to add a confidence score. Don't add it to sum up items on a receipt. Don't ask it to confirm how many rows there are in a PDF.

Sam Lijin

Sam Lijin

Stop writing prompts assuming that LLMs understand numbers:

  • ❌ don't ask it to add a "confidence score" to its response
  • ❌ don't ask it to sum up the items on a receipt
  • ❌ don't ask it to confirm how many rows there are in a PDF

We've known for a long time that LLMs are bad at any task that involves understanding the semantics of numbers, which means that they're bad at addition, multiplication, counting, and any task that assumes they're good at those tasks.

It's easy to assume that LLMs do understand numbers, because if you prompt it on something simple, like "count the R's in strawberry", a state of the art model like gpt-5.1 will give you the right answer.

But if you ask gpt-5.1 to do a less trivial counting task, though, it will try to refuse:

The provided snippet is far too large to accurately count functions, types, and imports by hand without risk of giving you an incorrect result.

If you insist that gpt-5.1 go through with the task, you can get an answer out of it, but its answer will be wrong:

{
	"fn_count": 82,
	"fn_names": [<array with 68 elements>],
	"type_count": 8,
	"type_names": [<array with 8 elements>],
	"import_count": 54,
	"import_names": [<array with 55 elements>]
}

Note also, that gpt-5.1's attempt to refuse to count is not an inherent reflection of intrinsic LLM behavior, but OpenAI's training decisions: because ChatGPT users so frequently assume that LLMs can count, OpenAI has intentionally trained their flagship model to build associations between "counting task" and "refusal" that were not originally present in the underlying training data, to provide a more useful intelligence-in-a-box product to its users. (Ironically, I suspect this actually reinforces the model's weaknesses around numeric semantic understanding.)

Boundary

Open source toolkit for AI development. Build type-safe AI applications with your team - all with confidence and reliability.

  • Company
  • About Us
  • Why BAML?
  • Privacy Policy
  • Terms of Service
  • Resources
  • Changelog
  • Docs
  • Jobs
  • Social
  • GitHub
  • Twitter
  • Discord
  • LinkedIn
  • YouTube