This is an automated archive made by the Lemmit Bot.
The original was posted on /r/singularity by /u/UnknownEssence on 2024-09-24 18:19:24+00:00.
GEMINI 1.5 PRO:
| Capability | Benchmark | May 2024 | Sep 2024 |
|
|
|
|
|
| General | MMLU-Pro | 69.0% | 75.8% |
| Code | Natural2Code | 82.6% | 85.4% |
| Math | MATH | 67.7% | 86.5% |
| | HiddenMath | 28.0% | 52.0% |
| Reasoning | GPQA (diamond) | 46.0% | 59.1% |
| Multilingual | WMT23 | 75.3 | 75.1 |
| Long Context | MRCR (1M) | 70.5% | 82.6% |
| Image | MMMU | 62.2% | 65.9% |
| | Vibe-Eval (Reka) | 48.9% | 53.9% |
| | MathVista | 63.9% | 68.1% |
| Audio | FLEURS (55 lang) | 6.5% | 6.7% |
| Video | Video-MME | 77.9% | 78.6% |
| Safety | XSTest | 88.4% | 98.8% |
GEMINI 1.5 FLASH:
| Capability | Benchmark | May 2024 | Sep 2024 |
|
|
|
|
|
| General | MMLU-Pro | 59.1% | 67.3% |
| Code | Natural2Code | 77.2% | 79.8% |
| Math | MATH | 54.9% | 77.9% |
| | HiddenMath | 20.3% | 47.2% |
| Reasoning | GPQA (diamond) | 41.4% | 51.0% |
| Multilingual | WMT23 | 74.1 | 73.9 |
| Long Context | MRCR (1M) | 70.1% | 71.9% |
| Image | MMMU | 56.1% | 62.3% |
| | Vibe-Eval (Reka) | 44.8% | 48.9% |
| | MathVista | 58.4% | 65.8% |
| Audio | FLEURS (55 lang) | 9.8% | 9.6% |
| Video | Video-MME | 74.7% | 76.1% |
| Safety | XSTest | 86.9% | 97.0% |