I've been working on this to inform my own decisions about which models to use in my agentic word processor, but I think it's also just useful data.
I just ran GPT 5.5 and it broke Gemini's previous high score of 92.5%!
The code and run artifacts are available on Github: https://github.com/reviseio/errata-bench