Skip to content

openai/gpt-oss-120b

Model Information

openai/gpt-oss-120b is the larger variant in OpenAI’s open-weight gpt-oss series, designed for reasoning-intensive, agentic, and production-scale applications. It is optimized to run on a single 80 GB GPU through a Mixture-of-Experts (MoE) architecture and provides developers with access to chain-of-thought reasoning, configurable reasoning levels, and native tool-use capabilities.

  • Model Developer: OpenAI
  • Model Release Date: August 2025
  • Supported Languages: Primarily English, with STEM and general knowledge coverage

Model Architecture

openai/gpt-oss-120b model is implemented as a sparse Mixture-of-Experts (MoE) Transformer. Only a subset of experts are active for each token, reducing compute cost while maintaining high reasoning performance.

  • Type: Decoder-only Transformer (MoE)
  • Total Parameters: 117B (~5.1B active per token)
  • Layers: 36, with 128 experts per layer (4 active)
  • Context Length: Up to 128K tokens
  • Attention: Multi-Head Self-Attention with Rotary Position Embeddings (RoPE)
  • Quantization: MXFP4 (post-training), optimized for 80 GB GPUs (e.g., NVIDIA H100, AMD MI300X)
  • Training Format: Harmony response format (required for correct outputs)
  • Reasoning Levels: Configurable — low, medium, high
  • Core Capabilities: Function calling, web browsing, Python execution, structured outputs
  • Fine-tuning: Supported on a single H100 node
  • License: Apache 2.0

Benchmark Scores

Category Benchmark Metric (Low / Med / High) gpt-oss-120b
General Knowledge MMLU (no tools) Accuracy 85.9 / 88.0 / 90.0
Competition Math AIME 2024 (no tools) Accuracy 56.3 / 80.4 / 95.8
Competition Math AIME 2024 (with tools) Accuracy 75.4 / 87.9 / 96.6
Competition Math AIME 2025 (no tools) Accuracy 50.4 / 80.0 / 92.5
Competition Math AIME 2025 (with tools) Accuracy 72.9 / 91.6 / 97.9
Science Reasoning GPQA Diamond (no tools) Accuracy 67.1 / 73.1 / 80.1
Science Reasoning GPQA Diamond (with tools) Accuracy 68.1 / 73.5 / 80.9
Programming Codeforces (no tools) Elo 1595 / 2205 / 2463
Programming Codeforces (with tools) Elo 1653 / 2365 / 2622
Health Domain HealthBench Accuracy 53.0 / 55.9 / 57.6

The model demonstrates strong performance across reasoning, math, science, and programming tasks. Tool use further improves results, bringing performance near parity with proprietary models.


References