OpenAI ChatGPT GPT-4 Turbo Gets A Mid-Life Boost, Here’s What You Should Know
When OpenAI's GPT-4 hit the internet, it was pretty much the best large language model (LLM) around. Many of OpenAI's competitors have long since surpassed the original GPT-4 on various metrics, from Claude's enormous context window to Gemini 1.5's excellent performance with complex multi-modal datasets. Of course, OpenAI hasn't been resting on its laurels this whole time. The company unveiled GPT-4 Turbo back in November, and now it has just announced an update to that model with some pretty significant changes.
In the most recent update, which has no fancy name, GPT4 Turbo is now "significantly smarter and more pleasant to use", according to OpenAI founder Sam Altman. While he didn't elaborate, it seems like Altman is primarily talking about changes to the model that have made its responses when being used as a chatbot "more direct, less verbose, and more conversational", for which OpenAI provides the following example as proof:
Image: OpenAI
The updated model also scores higher on most common AI benchmarks, including the Graduate-Level Google-Proof Q&A Benchmark. That challenging dataset was designed to test the abilities of LLMs and comprises a 448-question multiple-choice test with questions spread across every scientific domain. The questions are designed by experts in the respective fields to judge not only how well LLMs can answer questions, but also how well they can be overseen by humans. This test is GPT-4's weakest benchmark, and the new version improves its score on this test from approximately 35% to just under 50%, which is an excellent result on this difficult benchmark.
Other benchmarks that see gains include the reasoning-focused MATH test, the Multilingual Grade School Math (MGSM) benchmark, and the Discrete Reasoning Over Paragraphs (DROP) benchmark. DROP in particular is one of the most taxing AI benchmarks, and GPT-4 Turbo was already one of the best models in this test, but the new release improves its score on this difficult test to a bit over 80%, putting it in the exclusive category of models to reach such heights that includes, uh, itself. (The next best result is from Google's Gemini 1.5 Turbo at 78.9%.)