Document Analysis API¶
The Document Analysis API provides direct access to PaddleX-powered document analysis models for layout detection, text detection, and OCR. These endpoints allow you to analyze document images without going through the full knowledge extraction pipeline.
Overview¶
| Endpoint | Description | Model |
|---|---|---|
/v1/document-analysis/layout-detection |
Detect layout elements (tables, figures, text blocks, headers, etc.) | paddlex/RT-DETR-H_layout_17cls |
/v1/document-analysis/text-detection |
Detect text regions in document images | paddlex/PP-OCRv4_server_det |
/v1/document-analysis/ocr |
Full OCR — text detection + recognition with multi-language support | paddlex/PP-OCRv5_server |
Example Usage¶
Using the SDK Client¶
import os
from air.document_analysis import DocumentAnalysisClient, AsyncDocumentAnalysisClient
from dotenv import load_dotenv
load_dotenv()
api_key = str(os.getenv("API_KEY"))
client = DocumentAnalysisClient(api_key=api_key)
# Layout detection
layout_result = client.layout_detection(
model="paddlex/RT-DETR-H_layout_17cls",
image_path="document_page.png",
threshold=0.5,
)
for element in layout_result.elements:
print(f"{element.label}: score={element.score:.3f}, bbox={element.bbox}")
# Text detection
text_result = client.text_detection(
model="paddlex/PP-OCRv4_server_det",
image_path="document_page.png",
threshold=0.3,
)
print(f"Found {len(text_result.regions)} text regions")
# OCR
ocr_result = client.ocr(
model="paddlex/PP-OCRv5_server",
image_path="document_page.png",
language="en",
)
for r in ocr_result.results:
print(f"'{r.text}' (confidence: {r.score:.3f})")
Using the Async Client¶
import asyncio
from air.document_analysis import AsyncDocumentAnalysisClient
async def analyze():
client = AsyncDocumentAnalysisClient(api_key="...")
result = await client.ocr(
model="paddlex/PP-OCRv5_server",
image_base64="<base64-encoded-image>",
language="en",
)
for r in result.results:
print(r.text)
asyncio.run(analyze())
Processing Documents (PDF, PPTX, DOCX)¶
The document analysis endpoints above operate on images. To process full documents (PDF, PPTX, DOCX, DOC, PPT), use the Knowledge Extraction API which handles the full pipeline: document → page images → layout detection → OCR → text extraction.
Using parse_document¶
import os
import uuid
from air.client import AIRefinery
from air.types import Document, TextElement
client = AIRefinery(api_key=os.getenv("API_KEY"))
doc_client = client.knowledge.document_processing
# Extract text from a PDF
response = doc_client.parse_document(
file_path="report.pdf",
model="knowledge-brain/knowledge-brain",
timeout=300,
)
# response["text"] contains the combined extracted text
# response["pages"] contains per-page text: {"page0": "...", "page1": "..."}
print(f"Extracted {len(response['text'])} characters")
print(response["text"][:500])
# Use extracted text in a RAG pipeline
text_element = TextElement(
id=str(uuid.uuid4()),
text=response["text"],
page_number=1,
element_type="text",
)
doc = Document(
filename="report.pdf",
file_type="PDF",
elements=[text_element],
)
Supported File Types¶
| Model | Supported Formats |
|---|---|
knowledge-brain/knowledge-brain |
PDF, PPTX, DOCX, DOC, PPT |
How It Works¶
Under the hood, the Knowledge Brain extraction service:
- Converts each page of the document to an image
- Runs layout detection (
paddlex/RT-DETR-H_layout_17cls) to identify text blocks, tables, figures, etc. - Runs text detection (
paddlex/PP-OCRv4_server_det) to locate text regions - Runs OCR to recognize text in each detected region
- Optionally refines text using an LLM and extracts figures using a VLM
- Returns combined text and per-page results
For more details on the extraction pipeline and RAG integration, see the Knowledge Extraction API.
Class Overview¶
DocumentAnalysisClient¶
Synchronous client for PaddleX document analysis endpoints.
class DocumentAnalysisClient:
def __init__(
self,
api_key: str | TokenProvider,
*,
base_url: str = BASE_URL,
default_headers: dict[str, str] | None = None,
): ...
Parameters:¶
api_key(str | TokenProvider): API key or TokenProvider for authentication.base_url(Optional[str]): Base URL for the API. Defaults tohttps://api.airefinery.accenture.com.default_headers(Optional[dict]): Headers applied to every request.
Methods¶
layout_detection¶
Detect layout elements (tables, figures, text blocks, etc.) in a document image.
def layout_detection(
self,
*,
model: str = "paddlex/RT-DETR-H_layout_17cls",
image_path: str | None = None,
image_base64: str | None = None,
threshold: float = 0.5,
timeout: float | None = None,
extra_headers: dict[str, str] | None = None,
) -> LayoutDetectionResponse: ...
Parameters:¶
model(str): Model name. Default:"paddlex/RT-DETR-H_layout_17cls".image_path(Optional[str]): Path to image file. Mutually exclusive withimage_base64.image_base64(Optional[str]): Base64-encoded image data.threshold(float): Detection confidence threshold (0–1). Default:0.5.timeout(Optional[float]): Request timeout in seconds.extra_headers(Optional[dict]): Additional request headers.
Returns:¶
LayoutDetectionResponse: Containselements(list ofLayoutElement) andinference_time_ms.
text_detection¶
Detect text regions in a document image.
def text_detection(
self,
*,
model: str = "paddlex/PP-OCRv4_server_det",
image_path: str | None = None,
image_base64: str | None = None,
threshold: float = 0.3,
timeout: float | None = None,
extra_headers: dict[str, str] | None = None,
) -> TextDetectionResponse: ...
Parameters:¶
model(str): Model name. Default:"paddlex/PP-OCRv4_server_det".image_path(Optional[str]): Path to image file.image_base64(Optional[str]): Base64-encoded image data.threshold(float): Detection confidence threshold. Default:0.3.timeout(Optional[float]): Request timeout in seconds.
Returns:¶
TextDetectionResponse: Containsregions(list ofTextRegion) andinference_time_ms.
ocr¶
Perform OCR (text detection + recognition) on a document image.
def ocr(
self,
*,
model: str = "paddlex/PP-OCRv5_server",
image_path: str | None = None,
image_base64: str | None = None,
threshold: float = 0.3,
language: str = "en",
timeout: float | None = None,
extra_headers: dict[str, str] | None = None,
) -> OCRResponse: ...
Parameters:¶
model(str): Model name. Default:"paddlex/PP-OCRv5_server".image_path(Optional[str]): Path to image file.image_base64(Optional[str]): Base64-encoded image data.threshold(float): Detection confidence threshold. Default:0.3.language(str): OCR language. Default:"en". Supported:en,ch,japan,korean,latin,arabic, and more.timeout(Optional[float]): Request timeout in seconds.
Returns:¶
OCRResponse: Containsresults(list ofOCRResult) andinference_time_ms.
AsyncDocumentAnalysisClient¶
Asynchronous client with the same methods as DocumentAnalysisClient. All methods are async and should be awaited.
Response Models¶
LayoutDetectionResponse¶
class LayoutDetectionResponse(BaseModel):
elements: list[LayoutElement] # Detected layout elements
inference_time_ms: float # Inference time in milliseconds
LayoutElement¶
class LayoutElement(BaseModel):
label: str # Element type: "text", "title", "figure", "table", "header", etc.
score: float # Detection confidence (0–1)
bbox: list[float] # Bounding box [x1, y1, x2, y2]
The layout detection model (RT-DETR-H_layout_17cls) supports 17 element classes:
text, title, figure, figure_caption, table, table_caption, header, footer, reference, equation, list-item, index, code, algorithm, abstract, author, stamp.
TextDetectionResponse¶
class TextDetectionResponse(BaseModel):
regions: list[TextRegion] # Detected text regions
inference_time_ms: float
TextRegion¶
class TextRegion(BaseModel):
bbox: list[list[float]] # Quadrilateral bounding box (4 corner points)
score: float # Detection confidence