News
Opus 4 is Anthropic’s new crown jewel, hailed by the company as its most powerful effort yet and the “world’s best coding ...
3d
Calendar on MSNClaude Opus 4 achieves record performance in AI coding capabilitiesAnthropic’s latest AI model, Claude Opus 4, has surpassed OpenAI’s GPT-4.1 in coding abilities, marking a significant shift ...
As AI capabilities continue advancing, researchers are developing evaluation methods that test for genuine understanding.
5d
Tech Xplore on MSNBeyond translation: Multilingual benchmark makes AI multiculturalImagine asking a conversational bot like Claude or ChatGPT a legal question in Greek about local traffic regulations. Within ...
Dieselgate' scandal, new research suggests that AI language models such as GPT-4, Claude, and Gemini may change their ...
The new Gemini 2.5 Pro shows a 24-point Elo score increase on LMArena, holding a top score of 1470 and maintaining its ...
As large language models (LLMs) rapidly evolve, so does their promise as powerful research assistants. Increasingly, they’re ...
The Allen Institute of AI updated its reward model evaluation RewardBench to better reflect real-life scenarios for enterprises.
Alibaba introduces a new benchmark aimed at evaluating how well AI translation systems perform in real-world industry ...
5d
Study Finds on MSNTop AI Models Flunk Graduate-Level History ExamResearchers put seven leading AI models through graduate-level history exams, but even the best-performing model performed ...
Fourteen leading organizations in blockchain and artificial intelligence, including Cyber, EigenLayer, Sentient, and others, ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results