openai/gpt-oss-120b¶

Model Information¶

openai/gpt-oss-120b is the larger variant in OpenAI’s open-weight gpt-oss series, designed for reasoning-intensive, agentic, and production-scale applications. It is optimized to run on a single 80 GB GPU through a Mixture-of-Experts (MoE) architecture and provides developers with access to chain-of-thought reasoning, configurable reasoning levels, and native tool-use capabilities.

Model Developer: OpenAI
Model Release Date: August 2025
Supported Languages: Primarily English, with STEM and general knowledge coverage

Model Architecture¶

openai/gpt-oss-120b model is implemented as a sparse Mixture-of-Experts (MoE) Transformer. Only a subset of experts are active for each token, reducing compute cost while maintaining high reasoning performance.

Type: Decoder-only Transformer (MoE)
Total Parameters: 117B (~5.1B active per token)
Layers: 36, with 128 experts per layer (4 active)
Context Length: Up to 128K tokens
Attention: Multi-Head Self-Attention with Rotary Position Embeddings (RoPE)
Quantization: MXFP4 (post-training), optimized for 80 GB GPUs (e.g., NVIDIA H100, AMD MI300X)
Training Format: Harmony response format (required for correct outputs)
Reasoning Levels: Configurable — low, medium, high
Core Capabilities: Function calling, web browsing, Python execution, structured outputs
Fine-tuning: Supported on a single H100 node
License: Apache 2.0

Benchmark Scores¶

Category	Benchmark	Metric (Low / Med / High)	gpt-oss-120b
General Knowledge	MMLU (no tools)	Accuracy	85.9 / 88.0 / 90.0
Competition Math	AIME 2024 (no tools)	Accuracy	56.3 / 80.4 / 95.8
Competition Math	AIME 2024 (with tools)	Accuracy	75.4 / 87.9 / 96.6
Competition Math	AIME 2025 (no tools)	Accuracy	50.4 / 80.0 / 92.5
Competition Math	AIME 2025 (with tools)	Accuracy	72.9 / 91.6 / 97.9
Science Reasoning	GPQA Diamond (no tools)	Accuracy	67.1 / 73.1 / 80.1
Science Reasoning	GPQA Diamond (with tools)	Accuracy	68.1 / 73.5 / 80.9
Programming	Codeforces (no tools)	Elo	1595 / 2205 / 2463
Programming	Codeforces (with tools)	Elo	1653 / 2365 / 2622
Health Domain	HealthBench	Accuracy	53.0 / 55.9 / 57.6

The model demonstrates strong performance across reasoning, math, science, and programming tasks. Tool use further improves results, bringing performance near parity with proprietary models.

openai/gpt-oss-120b¶

Model Information¶

Model Architecture¶

Benchmark Scores¶

References¶