Why the reasoning models are genius in mathematics, but stupid in everything

Author: sq m

Originally published in the direction of artificial intelligence.

We trained artificial intelligence to be mathematical genius, but inadvertently we created conversation disasters. – Carnegie Mellon University

(Link without members)

AI models consistently outweigh the mathematics coefficients every week. Some even defeated human experts in competitions such as mathematics and Aime.

But here is what nobody talks about: these mathematics geniuses can often not cope with basic conversations.

Scientists from Carnegie Mellon University have just published evidence that will make you think about the way we train artificial intelligence. In their study, over 20 models focused on reasoning were examined and considered something shocking.

The better the model has mathematics, the worse it becomes in everything else.

Photo Antoine Dautra on Unsplash

The research team tested models in three separate categories:

Mathematical reasoning tasks: Math-500, Aime24, Aime25 and Olympiadbench. Other reasoning tasks: Livecodebench (coding), GPQA-Diamond (QA Science QA), ACPBENCH (Agent Planning) and Headq (Medical reasoning) Unjustified tasks: CoQA (Conversation Ca), IF if IF EVAL (observing), Hallueval (Halluceval), Hallueval (Hallceval).

They created an indicator of the ability to transfer to measure how well mathematics improvements translate into other domains:

Ti_other (%) = (performance_gain_other / performance_gain_math) × 100ti_non (%) = (performance_non / efficiency_gain_math) × 100

Positive numbers indicate that mathematical skills have helped in other tasks. Negative numbers indicate that the performance of the model has fallen in general abilities.

Dig. 2 from the research article

Dig. 2 reveals a pattern that crosses all the size of the model and architecture:

Learning to strengthen … Read the full blog for free on the medium.

Published via AI

LEAVE A REPLY

Please enter your comment!
Please enter your name here