OmniBench

Leaderboard

The main focus of OmniBench is to evaluate how well could the omni-language models (OLMs) understand and reconstruct the context given information from image, audio and text modalities. Setting up questions with four available options for the models, we use accuracy, i.e., the ratio matched letter of the correct option and model response, as the evaluation metric (n.b., the accuracy of a random guess model is 25% under this setting).

Open-Source OLM Proprietary OLM

Open-Source VLM or ALM Proprietary VLM or ALM

The first row suggets the input context, where "Img. & Aud." refers to vanilla image and audio, and "(T)" refers to the textual alternative of image and audio. Click on the 4 setting columns to expand detailed results.

Reset			Img. & Aud.	Img.(T) & Aud.	Img. & Aud. (T)	Img. (T) & Aud. (T)
Name	Size	Date	Overall	Action & Activity	Story Description	Plot Inference	Object Identification & Description	Contextual & Environmental	Identity & Relationship	Text & Symbols	Count & Quantity	Overall	Action & Activity	Story Description	Plot Inference	Object Identification & Description	Contextual & Environmental	Identity & Relationship	Text & Symbols	Count & Quantity	Overall	Action & Activity	Story Description	Plot Inference	Object Identification & Description	Contextual & Environmental	Identity & Relationship	Text & Symbols	Count & Quantity	Overall	Action & Activity	Story Description	Plot Inference	Object Identification & Description	Contextual & Environmental	Identity & Relationship	Text & Symbols	Count & Quantity

Overall results of different models on the Omnibench leaderboard.

🔔OmniBench: Towards The Future of Universal Omni-Language Models

Data Samples Across Categories

Action and Activity

Contextual and Environmental

Count and Quantity

Identity and Relationship

Object Identification and Description

Plot Inference

Story Description

Text and Symbols

OmniBench

Leaderboard