The main focus of OmniBench is to evaluate how well could the omni-language models (OLMs) understand and reconstruct the context given information from image, audio and text modalities. Setting up questions with four available options for the models, we use accuracy, i.e., the ratio matched letter of the correct option and model response, as the evaluation metric (n.b., the accuracy of a random guess model is 25% under this setting).
The first row suggets the input context, where "Img. & Aud." refers to vanilla image and audio, and "(T)" refers to the textual alternative of image and audio. Click on the 4 setting columns to expand detailed results.
Reset | Img. & Aud. | Img.(T) & Aud. | Img. & Aud. (T) | Img. (T) & Aud. (T) | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | Size | Date | Overall | Action & Activity | Story Description | Plot Inference | Object Identification & Description | Contextual & Environmental | Identity & Relationship | Text & Symbols | Count & Quantity | Overall | Action & Activity | Story Description | Plot Inference | Object Identification & Description | Contextual & Environmental | Identity & Relationship | Text & Symbols | Count & Quantity | Overall | Action & Activity | Story Description | Plot Inference | Object Identification & Description | Contextual & Environmental | Identity & Relationship | Text & Symbols | Count & Quantity | Overall | Action & Activity | Story Description | Plot Inference | Object Identification & Description | Contextual & Environmental | Identity & Relationship | Text & Symbols | Count & Quantity |
Overall results of different models on the Omnibench leaderboard.