GPT 5.5 sets new record in proofreading benchmark

3 points by artursapek 17 hours ago|1 comments

•

artursapek 17 hours ago

Hi HN - this is a benchmark I developed that tests various models against large samples of text, asking them to find and fix a variety of errors. Its purpose is to evaluate how good models are at proofreading (a common use case of LLMs) and how efficient they are on various axes.

I've been working on this to inform my own decisions about which models to use in my agentic word processor, but I think it's also just useful data.

I just ran GPT 5.5 and it broke Gemini's previous high score of 92.5%!

The code and run artifacts are available on Github: https://github.com/reviseio/errata-bench