Skip to content

Document Analysis API

The Document Analysis API provides direct access to PaddleX-powered document analysis models for layout detection, text detection, and OCR. These endpoints allow you to analyze document images without going through the full knowledge extraction pipeline.

Overview

Endpoint Description Model
/v1/document-analysis/layout-detection Detect layout elements (tables, figures, text blocks, headers, etc.) paddlex/RT-DETR-H_layout_17cls
/v1/document-analysis/text-detection Detect text regions in document images paddlex/PP-OCRv4_server_det
/v1/document-analysis/ocr Full OCR — text detection + recognition with multi-language support paddlex/PP-OCRv5_server

Example Usage

Using the SDK Client

import os
from air.document_analysis import DocumentAnalysisClient, AsyncDocumentAnalysisClient
from dotenv import load_dotenv

load_dotenv()
api_key = str(os.getenv("API_KEY"))

client = DocumentAnalysisClient(api_key=api_key)

# Layout detection
layout_result = client.layout_detection(
    model="paddlex/RT-DETR-H_layout_17cls",
    image_path="document_page.png",
    threshold=0.5,
)
for element in layout_result.elements:
    print(f"{element.label}: score={element.score:.3f}, bbox={element.bbox}")

# Text detection
text_result = client.text_detection(
    model="paddlex/PP-OCRv4_server_det",
    image_path="document_page.png",
    threshold=0.3,
)
print(f"Found {len(text_result.regions)} text regions")

# OCR
ocr_result = client.ocr(
    model="paddlex/PP-OCRv5_server",
    image_path="document_page.png",
    language="en",
)
for r in ocr_result.results:
    print(f"'{r.text}' (confidence: {r.score:.3f})")

Using the Async Client

import asyncio
from air.document_analysis import AsyncDocumentAnalysisClient

async def analyze():
    client = AsyncDocumentAnalysisClient(api_key="...")

    result = await client.ocr(
        model="paddlex/PP-OCRv5_server",
        image_base64="<base64-encoded-image>",
        language="en",
    )
    for r in result.results:
        print(r.text)

asyncio.run(analyze())

Processing Documents (PDF, PPTX, DOCX)

The document analysis endpoints above operate on images. To process full documents (PDF, PPTX, DOCX, DOC, PPT), use the Knowledge Extraction API which handles the full pipeline: document → page images → layout detection → OCR → text extraction.

Using parse_document

import os
import uuid
from air.client import AIRefinery
from air.types import Document, TextElement

client = AIRefinery(api_key=os.getenv("API_KEY"))
doc_client = client.knowledge.document_processing

# Extract text from a PDF
response = doc_client.parse_document(
    file_path="report.pdf",
    model="knowledge-brain/knowledge-brain",
    timeout=300,
)

# response["text"] contains the combined extracted text
# response["pages"] contains per-page text: {"page0": "...", "page1": "..."}
print(f"Extracted {len(response['text'])} characters")
print(response["text"][:500])

# Use extracted text in a RAG pipeline
text_element = TextElement(
    id=str(uuid.uuid4()),
    text=response["text"],
    page_number=1,
    element_type="text",
)
doc = Document(
    filename="report.pdf",
    file_type="PDF",
    elements=[text_element],
)

Supported File Types

Model Supported Formats
knowledge-brain/knowledge-brain PDF, PPTX, DOCX, DOC, PPT

How It Works

Under the hood, the Knowledge Brain extraction service:

  1. Converts each page of the document to an image
  2. Runs layout detection (paddlex/RT-DETR-H_layout_17cls) to identify text blocks, tables, figures, etc.
  3. Runs text detection (paddlex/PP-OCRv4_server_det) to locate text regions
  4. Runs OCR to recognize text in each detected region
  5. Optionally refines text using an LLM and extracts figures using a VLM
  6. Returns combined text and per-page results

For more details on the extraction pipeline and RAG integration, see the Knowledge Extraction API.


Class Overview

DocumentAnalysisClient

Synchronous client for PaddleX document analysis endpoints.

class DocumentAnalysisClient:
    def __init__(
        self,
        api_key: str | TokenProvider,
        *,
        base_url: str = BASE_URL,
        default_headers: dict[str, str] | None = None,
    ): ...
Parameters:
  • api_key (str | TokenProvider): API key or TokenProvider for authentication.
  • base_url (Optional[str]): Base URL for the API. Defaults to https://api.airefinery.accenture.com.
  • default_headers (Optional[dict]): Headers applied to every request.

Methods

layout_detection

Detect layout elements (tables, figures, text blocks, etc.) in a document image.

def layout_detection(
    self,
    *,
    model: str = "paddlex/RT-DETR-H_layout_17cls",
    image_path: str | None = None,
    image_base64: str | None = None,
    threshold: float = 0.5,
    timeout: float | None = None,
    extra_headers: dict[str, str] | None = None,
) -> LayoutDetectionResponse: ...
Parameters:
  • model (str): Model name. Default: "paddlex/RT-DETR-H_layout_17cls".
  • image_path (Optional[str]): Path to image file. Mutually exclusive with image_base64.
  • image_base64 (Optional[str]): Base64-encoded image data.
  • threshold (float): Detection confidence threshold (0–1). Default: 0.5.
  • timeout (Optional[float]): Request timeout in seconds.
  • extra_headers (Optional[dict]): Additional request headers.
Returns:
  • LayoutDetectionResponse: Contains elements (list of LayoutElement) and inference_time_ms.

text_detection

Detect text regions in a document image.

def text_detection(
    self,
    *,
    model: str = "paddlex/PP-OCRv4_server_det",
    image_path: str | None = None,
    image_base64: str | None = None,
    threshold: float = 0.3,
    timeout: float | None = None,
    extra_headers: dict[str, str] | None = None,
) -> TextDetectionResponse: ...
Parameters:
  • model (str): Model name. Default: "paddlex/PP-OCRv4_server_det".
  • image_path (Optional[str]): Path to image file.
  • image_base64 (Optional[str]): Base64-encoded image data.
  • threshold (float): Detection confidence threshold. Default: 0.3.
  • timeout (Optional[float]): Request timeout in seconds.
Returns:
  • TextDetectionResponse: Contains regions (list of TextRegion) and inference_time_ms.

ocr

Perform OCR (text detection + recognition) on a document image.

def ocr(
    self,
    *,
    model: str = "paddlex/PP-OCRv5_server",
    image_path: str | None = None,
    image_base64: str | None = None,
    threshold: float = 0.3,
    language: str = "en",
    timeout: float | None = None,
    extra_headers: dict[str, str] | None = None,
) -> OCRResponse: ...
Parameters:
  • model (str): Model name. Default: "paddlex/PP-OCRv5_server".
  • image_path (Optional[str]): Path to image file.
  • image_base64 (Optional[str]): Base64-encoded image data.
  • threshold (float): Detection confidence threshold. Default: 0.3.
  • language (str): OCR language. Default: "en". Supported: en, ch, japan, korean, latin, arabic, and more.
  • timeout (Optional[float]): Request timeout in seconds.
Returns:
  • OCRResponse: Contains results (list of OCRResult) and inference_time_ms.

AsyncDocumentAnalysisClient

Asynchronous client with the same methods as DocumentAnalysisClient. All methods are async and should be awaited.

Response Models

LayoutDetectionResponse

class LayoutDetectionResponse(BaseModel):
    elements: list[LayoutElement]     # Detected layout elements
    inference_time_ms: float          # Inference time in milliseconds

LayoutElement

class LayoutElement(BaseModel):
    label: str          # Element type: "text", "title", "figure", "table", "header", etc.
    score: float        # Detection confidence (0–1)
    bbox: list[float]   # Bounding box [x1, y1, x2, y2]

The layout detection model (RT-DETR-H_layout_17cls) supports 17 element classes: text, title, figure, figure_caption, table, table_caption, header, footer, reference, equation, list-item, index, code, algorithm, abstract, author, stamp.

TextDetectionResponse

class TextDetectionResponse(BaseModel):
    regions: list[TextRegion]         # Detected text regions
    inference_time_ms: float

TextRegion

class TextRegion(BaseModel):
    bbox: list[list[float]]   # Quadrilateral bounding box (4 corner points)
    score: float              # Detection confidence

OCRResponse

class OCRResponse(BaseModel):
    results: list[OCRResult]          # OCR results
    inference_time_ms: float

OCRResult

class OCRResult(BaseModel):
    text: str               # Recognized text
    score: float            # Recognition confidence
    bbox: list[list[float]] # Quadrilateral bounding box