meta-llama/Llama-3.2-90B-Vision-Instruct¶

Model Information¶

meta-llama/Llama-3.2-90B-Vision-Instruct is a multimodal instruction-tuned model from Meta's LLaMA 3.2 series. It extends the powerful language capabilities of the LLaMA 3.2 family with robust visual reasoning through integrated image understanding. The model is designed for tasks such as visual question answering, chart and document understanding, image captioning, and grounded dialogue.

Model Developer: Meta
Model Release Date: July 2024
Supported Languages: English (primary), with extended support for major European and Asian languages including French, Spanish, German, Portuguese, Hindi, Thai, and others for multilingual prompting.

Model Architecture¶

Llama-3.2-90B-Vision-Instruct is a 90B-parameter decoder-only transformer with multimodal capabilities.

Key components include:

Vision-Language Fusion: Integrates a vision encoder (e.g., image patch encoder) with the LLaMA 3.2 transformer backbone
Token Context Length: Supports 8K+ tokens
Image Input Format: Images are encoded into discrete tokens, allowing alignment with the text stream
Training:
- Pretrained on paired image-text datasets (e.g., OCR, charts, natural images)
- Instruction-tuned for grounded multimodal reasoning
- Aligned using preference data for helpfulness and safety in vision-language tasks
Multimodal Capabilities:
- Document understanding
- Image captioning and VQA (Visual Question Answering)
- Visual instruction following

Benchmark Scores¶

Category	Benchmark	Shots	Metric	LLaMA 3.2 90B Vision-Instruct
General	MMLU (CoT)	0	Acc. (avg)	87.1
	MMLU Pro (CoT)	5	Acc. (avg)	59.4
Steerability	IFEval	–	–	92.6
Reasoning	GPQA Diamond (CoT)	0	Accuracy	46.8
Code	HumanEval	0	Pass@1	84.3
	MBPP EvalPlus (base)	0	Pass@1	85.0
Math	MATH (CoT)	0	Sympy Score	59.8
Tool Use	BFCL v2	0	AST Macro Avg.	80.1
Multilingual	MGSM	0	EM (exact match)	77.2

These results position LLaMA 3.2 90B Vision-Instruct among the strongest open-access multimodal models available as of mid-2024, combining strong text performance with grounded visual reasoning.

meta-llama/Llama-3.2-90B-Vision-Instruct¶

Model Information¶

Model Architecture¶

Benchmark Scores¶

References¶