🔔OmniBench: Towards The Future of Universal Omni-Language Models

Data Samples Across Categories

Action and Activity

Action and Activity

Contextual and Environmental

Contextual and Environmental

Count and Quantity

Count and Quantity

Identity and Relationship

Identity and Relationship

Object Identification and Description

Object Identification and Description

Plot Inference

Plot Inference

Story Description

Story Description

Text and Symbols

Text and Symbols

OmniBench

Leaderboard

The main focus of OmniBench is to evaluate how well could the omni-language models (OLMs) understand and reconstruct the context given information from image, audio and text modalities. Setting up questions with four available options for the models, we use accuracy, i.e., the ratio matched letter of the correct option and model response, as the evaluation metric (n.b., the accuracy of a random guess model is 25% under this setting).


Open-Source OLM Proprietary OLM

Open-Source VLM or ALM Proprietary VLM or ALM

The first row suggets the input context, where "Img. & Aud." refers to vanilla image and audio, and "(T)" refers to the textual alternative of image and audio. Click on the 4 setting columns to expand detailed results.

Reset Img. & Aud. Img.(T) & Aud. Img. & Aud. (T) Img. (T) & Aud. (T)
Name Size Date Overall Overall Overall Overall

Overall results of different models on the Omnibench leaderboard.