Skip to content

Compression API

The Compression API reduces the length of prompts while preserving their semantic meaning, using token-level importance prediction. This is useful for reducing token costs when calling LLMs with long contexts, while maintaining output quality.

You can utilize this API through our SDK using either the AIRefinery or AsyncAIRefinery clients.

Asynchronous Prompt Compression

AsyncAIRefinery.compression.compress()

Parameters:
  • context (string or array, Required): The text to compress, provided as a single string or an array of strings.
  • model (string, Required): The compression model to use. See the Compressors section of our model catalog for available models.
  • rate (float, Optional, Defaults to 0.5): Target compression rate between 0.0 and 1.0. Lower values produce more aggressive compression.
  • target_token (integer, Optional, Defaults to -1): Explicit target token count. Set to -1 to use rate-based compression instead.
  • instruction (string, Optional): Optional instruction to provide context for the compression.
  • question (string, Optional): Optional question to provide context for the compression.
  • force_tokens (array of strings, Optional): Tokens that must be preserved in the compressed output.
  • timeout (float, Optional): Max time (in seconds) to wait for a response. Defaults to 60 seconds.
  • extra_headers (dict, Optional): Request-specific headers that override any default headers.
Returns:

A CompressionResponse object containing:

  • data (List[CompressedPrompt]): A list of compression results, one per input context. Each item contains:
    • compressed_prompt (string): The compressed version of the input text.
    • origin_tokens (integer): Number of tokens in the original input.
    • compressed_tokens (integer): Number of tokens in the compressed output.
Example Usage:
import asyncio
import os

from air import AsyncAIRefinery
from dotenv import load_dotenv

load_dotenv()  # loads your API_KEY from a .env file
api_key = str(os.getenv("API_KEY"))


async def compress_prompt():
    client = AsyncAIRefinery(api_key=api_key)

    response = await client.compression.compress(
        context="The quick brown fox jumps over the lazy dog. "
                "This is a test of the prompt compression system "
                "which should reduce the number of tokens while "
                "preserving the meaning of the text.",
        model="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
        rate=0.5,
    )

    for item in response.data:
        print(f"Compressed: {item.compressed_prompt}")
        print(f"Tokens: {item.origin_tokens}{item.compressed_tokens}")


if __name__ == "__main__":
    asyncio.run(compress_prompt())

Synchronous Prompt Compression

AIRefinery.compression.compress()

The AIRefinery client compresses prompts in a synchronous manner. This method supports the same parameters and return structure as the asynchronous method (AsyncAIRefinery.compression.compress()) described above.

Example Usage:
import os

from air import AIRefinery
from dotenv import load_dotenv

load_dotenv()  # loads your API_KEY from a .env file
api_key = str(os.getenv("API_KEY"))


def compress_prompt():
    client = AIRefinery(api_key=api_key)

    response = client.compression.compress(
        context="The quick brown fox jumps over the lazy dog. "
                "This is a test of the prompt compression system "
                "which should reduce the number of tokens while "
                "preserving the meaning of the text.",
        model="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
        rate=0.5,
    )

    for item in response.data:
        print(f"Compressed: {item.compressed_prompt}")
        print(f"Tokens: {item.origin_tokens}{item.compressed_tokens}")


if __name__ == "__main__":
    compress_prompt()

Batch Compression

You can compress multiple texts at once by passing a list of strings:

response = await client.compression.compress(
    context=[
        "First long paragraph that needs compression...",
        "Second long paragraph that needs compression...",
    ],
    model="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    rate=0.5,
)

for i, item in enumerate(response.data):
    print(f"Text {i+1}: {item.origin_tokens}{item.compressed_tokens} tokens")