Document Analysis API¶

The Document Analysis API provides direct access to PaddleX-powered document analysis models for layout detection, text detection, and OCR. These endpoints allow you to analyze document images without going through the full knowledge extraction pipeline.

Overview¶

Endpoint	Description	Model
`/v1/document-analysis/layout-detection`	Detect layout elements (tables, figures, text blocks, headers, etc.)	`paddlex/RT-DETR-H_layout_17cls`
`/v1/document-analysis/text-detection`	Detect text regions in document images	`paddlex/PP-OCRv4_server_det`
`/v1/document-analysis/ocr`	Full OCR — text detection + recognition with multi-language support	`paddlex/PP-OCRv5_server`

Example Usage¶

Using the SDK Client¶

import os
from air.document_analysis import DocumentAnalysisClient, AsyncDocumentAnalysisClient
from dotenv import load_dotenv

load_dotenv()
api_key = str(os.getenv("API_KEY"))

client = DocumentAnalysisClient(api_key=api_key)

# Layout detection
layout_result = client.layout_detection(
    model="paddlex/RT-DETR-H_layout_17cls",
    image_path="document_page.png",
    threshold=0.5,
)
for element in layout_result.elements:
    print(f"{element.label}: score={element.score:.3f}, bbox={element.bbox}")

# Text detection
text_result = client.text_detection(
    model="paddlex/PP-OCRv4_server_det",
    image_path="document_page.png",
    threshold=0.3,
)
print(f"Found {len(text_result.regions)} text regions")

# OCR
ocr_result = client.ocr(
    model="paddlex/PP-OCRv5_server",
    image_path="document_page.png",
    language="en",
)
for r in ocr_result.results:
    print(f"'{r.text}' (confidence: {r.score:.3f})")

Using the Async Client¶

import asyncio
from air.document_analysis import AsyncDocumentAnalysisClient

async def analyze():
    client = AsyncDocumentAnalysisClient(api_key="...")

    result = await client.ocr(
        model="paddlex/PP-OCRv5_server",
        image_base64="<base64-encoded-image>",
        language="en",
    )
    for r in result.results:
        print(r.text)

asyncio.run(analyze())

Processing Documents (PDF, PPTX, DOCX)¶

The document analysis endpoints above operate on images. To process full documents (PDF, PPTX, DOCX, DOC, PPT), use the Knowledge Extraction API which handles the full pipeline: document → page images → layout detection → OCR → text extraction.

Using `parse_document`¶

import os
import uuid
from air.client import AIRefinery
from air.types import Document, TextElement

client = AIRefinery(api_key=os.getenv("API_KEY"))
doc_client = client.knowledge.document_processing

# Extract text from a PDF
response = doc_client.parse_document(
    file_path="report.pdf",
    model="knowledge-brain/knowledge-brain",
    timeout=300,
)

# response["text"] contains the combined extracted text
# response["pages"] contains per-page text: {"page0": "...", "page1": "..."}
print(f"Extracted {len(response['text'])} characters")
print(response["text"][:500])

# Use extracted text in a RAG pipeline
text_element = TextElement(
    id=str(uuid.uuid4()),
    text=response["text"],
    page_number=1,
    element_type="text",
)
doc = Document(
    filename="report.pdf",
    file_type="PDF",
    elements=[text_element],
)

Supported File Types¶

Model	Supported Formats
`knowledge-brain/knowledge-brain`	PDF, PPTX, DOCX, DOC, PPT

How It Works¶

Under the hood, the Knowledge Brain extraction service:

Converts each page of the document to an image
Runs layout detection (paddlex/RT-DETR-H_layout_17cls) to identify text blocks, tables, figures, etc.
Runs text detection (paddlex/PP-OCRv4_server_det) to locate text regions
Runs OCR to recognize text in each detected region
Optionally refines text using an LLM and extracts figures using a VLM
Returns combined text and per-page results

For more details on the extraction pipeline and RAG integration, see the Knowledge Extraction API.

Class Overview¶

`DocumentAnalysisClient`¶

Synchronous client for PaddleX document analysis endpoints.

class DocumentAnalysisClient:
    def __init__(
        self,
        api_key: str | TokenProvider,
        *,
        base_url: str = BASE_URL,
        default_headers: dict[str, str] | None = None,
    ): ...

Parameters:¶

api_key (str | TokenProvider): API key or TokenProvider for authentication.
base_url (Optional[str]): Base URL for the API. Defaults to https://api.airefinery.accenture.com.
default_headers (Optional[dict]): Headers applied to every request.

Methods¶

`layout_detection`¶

Detect layout elements (tables, figures, text blocks, etc.) in a document image.

def layout_detection(
    self,
    *,
    model: str = "paddlex/RT-DETR-H_layout_17cls",
    image_path: str | None = None,
    image_base64: str | None = None,
    threshold: float = 0.5,
    timeout: float | None = None,
    extra_headers: dict[str, str] | None = None,
) -> LayoutDetectionResponse: ...

Parameters:¶

model (str): Model name. Default: "paddlex/RT-DETR-H_layout_17cls".
image_path (Optional[str]): Path to image file. Mutually exclusive with image_base64.
image_base64 (Optional[str]): Base64-encoded image data.
threshold (float): Detection confidence threshold (0–1). Default: 0.5.
timeout (Optional[float]): Request timeout in seconds.
extra_headers (Optional[dict]): Additional request headers.

Returns:¶

LayoutDetectionResponse: Contains elements (list of LayoutElement) and inference_time_ms.

`text_detection`¶

Detect text regions in a document image.

def text_detection(
    self,
    *,
    model: str = "paddlex/PP-OCRv4_server_det",
    image_path: str | None = None,
    image_base64: str | None = None,
    threshold: float = 0.3,
    timeout: float | None = None,
    extra_headers: dict[str, str] | None = None,
) -> TextDetectionResponse: ...

Parameters:¶

model (str): Model name. Default: "paddlex/PP-OCRv4_server_det".
image_path (Optional[str]): Path to image file.
image_base64 (Optional[str]): Base64-encoded image data.
threshold (float): Detection confidence threshold. Default: 0.3.
timeout (Optional[float]): Request timeout in seconds.

Returns:¶

TextDetectionResponse: Contains regions (list of TextRegion) and inference_time_ms.

`ocr`¶

Perform OCR (text detection + recognition) on a document image.

def ocr(
    self,
    *,
    model: str = "paddlex/PP-OCRv5_server",
    image_path: str | None = None,
    image_base64: str | None = None,
    threshold: float = 0.3,
    language: str = "en",
    timeout: float | None = None,
    extra_headers: dict[str, str] | None = None,
) -> OCRResponse: ...

Parameters:¶

model (str): Model name. Default: "paddlex/PP-OCRv5_server".
image_path (Optional[str]): Path to image file.
image_base64 (Optional[str]): Base64-encoded image data.
threshold (float): Detection confidence threshold. Default: 0.3.
language (str): OCR language. Default: "en". Supported: en, ch, japan, korean, latin, arabic, and more.
timeout (Optional[float]): Request timeout in seconds.

Returns:¶

OCRResponse: Contains results (list of OCRResult) and inference_time_ms.

`AsyncDocumentAnalysisClient`¶

Asynchronous client with the same methods as DocumentAnalysisClient. All methods are async and should be awaited.

Response Models¶

`LayoutDetectionResponse`¶

class LayoutDetectionResponse(BaseModel):
    elements: list[LayoutElement]     # Detected layout elements
    inference_time_ms: float          # Inference time in milliseconds

`LayoutElement`¶

class LayoutElement(BaseModel):
    label: str          # Element type: "text", "title", "figure", "table", "header", etc.
    score: float        # Detection confidence (0–1)
    bbox: list[float]   # Bounding box [x1, y1, x2, y2]

The layout detection model (RT-DETR-H_layout_17cls) supports 17 element classes: text, title, figure, figure_caption, table, table_caption, header, footer, reference, equation, list-item, index, code, algorithm, abstract, author, stamp.

`TextDetectionResponse`¶

class TextDetectionResponse(BaseModel):
    regions: list[TextRegion]         # Detected text regions
    inference_time_ms: float

`TextRegion`¶

class TextRegion(BaseModel):
    bbox: list[list[float]]   # Quadrilateral bounding box (4 corner points)
    score: float              # Detection confidence

`OCRResponse`¶

class OCRResponse(BaseModel):
    results: list[OCRResult]          # OCR results
    inference_time_ms: float

`OCRResult`¶

class OCRResult(BaseModel):
    text: str               # Recognized text
    score: float            # Recognition confidence
    bbox: list[list[float]] # Quadrilateral bounding box

Document Analysis API¶

Overview¶

Example Usage¶

Using the SDK Client¶

Using the Async Client¶

Processing Documents (PDF, PPTX, DOCX)¶

Using parse_document¶

Supported File Types¶

How It Works¶

Class Overview¶

DocumentAnalysisClient¶

Parameters:¶

Methods¶

layout_detection¶

Parameters:¶

Returns:¶

text_detection¶

Parameters:¶

Returns:¶

ocr¶

Parameters:¶

Returns:¶

AsyncDocumentAnalysisClient¶

Response Models¶

LayoutDetectionResponse¶

LayoutElement¶

TextDetectionResponse¶

TextRegion¶

OCRResponse¶

OCRResult¶

Using `parse_document`¶

`DocumentAnalysisClient`¶

`layout_detection`¶

`text_detection`¶

`ocr`¶

`AsyncDocumentAnalysisClient`¶

`LayoutDetectionResponse`¶

`LayoutElement`¶

`TextDetectionResponse`¶

`TextRegion`¶

`OCRResponse`¶

`OCRResult`¶