Skip to content

Knowledge Extraction API

This documentation provides an overview of our Knowledge Extraction API. This API allows users to extract knowledge from various formats of input documents which typically contain text, tables, and figures. The extracted knowledge will be converted to compact embedding space and stored in a specified vector database. This facilitates the downstream knowledge search RAG applications, e.g., by using AI Refinery's built-in research agent.

Example Usage

In this example we show how to create the DocumentProcessingClient object using the unified AIRefinery client and use methods parse_document to parse input documents as well as pipeline to perform a series of operations on the parsed documents. The end result is a vector database ready with all extracted knowledge.

The knowledge extraction functionality is exposed via AIRefinery client and this example demonstrate how to access it.

import os
import uuid

from air import login
from air.api.vector_db import VectorDBConfig
from air.client import AIRefinery
from air.types import Document, TextElement, ChunkingConfig, EmbeddingConfig, VectorDBUploadConfig, DocumentProcessingConfig

api_key = str(os.getenv("API_KEY"))
auth = login(
    account=str(os.getenv("ACCOUNT")),
    api_key=api_key,
)
base_url = os.getenv("AIREFINERY_ADDRESS", "")

vectordb_config = VectorDBConfig(base_url="https://<service_base_url>.search.windows.net", api_key="<your-api-key?", api_version="2023-11-01", index="<your-api-version>")
upload_config = VectorDBUploadConfig(batch_size=50, max_workers=1)
embedding_config = EmbeddingConfig(model="intfloat/e5-mistral-7b-instruct", batch_size=32, max_workers=1)
chunking_config = ChunkingConfig(algorithm="BruteForceChunking", chunk_size=10, overlap_size=0)

# Create a unified AIRefinery client
client = AIRefinery(base_url=base_url, api_key=api_key)

# get document processing client from unfied airefinery client
document_processing_client = client.knowledge.document_processing

# create document processing configuration
doc_process_config = DocumentProcessingConfig(upload_config=upload_config, vectordb_config=vectordb_config, embedding_config=embedding_config, chunking_config=chunking_config)

# configure document processing project with configuration
document_processing_client.create_project(doc_process_config=doc_process_config)  # type: ignore

def knowledge_extraction():

    print("Example of parse_documents:\n")
    # Choose a model: "nv-ingest/nv-ingest" or "knowledge-brain/knowledge-brain"
    extraction_model = "knowledge-brain/knowledge-brain"
    # path to the local file
    file_path = "<path-to-your-file>"

    try:
        # parse documents: extract content from the given document using the specified extraction model
        # set timeout in seconds, increase timeout according to file content/pages
        response = document_processing_client.parse_document(
            file_path=file_path, model=extraction_model, timeout=300
        )
    except Exception as e:
        print(f"Failed to extract knowledge. {e}")
        return
    print(f"This is the response of parse_documents method: {response}")


    print("Example of pipeline:\n")
    text_element = TextElement(
        id=str(uuid.uuid4()),
        text=response["text"],
        page_number=1,
        element_type="text",
        text_vector=[],
    )

    # create Document object for pipeline
    doc = Document(
        filename=os.path.basename(file_path),
        file_type="PDF",
        elements=[text_element],
        metadata={},
    )
    documents = [doc]

    # list of tasks to perform in pipeline
    pipeline_steps = ["chunk", "embed", "upload"] 
    # execute pipeline: chunk, embed and upload from the list of documents
    status_dict = document_processing_client.pipeline(documents, pipeline_steps)
    print(f"Response of pipeline: {status_dict}")


if __name__ == "__main__":

    print("\nExample of extracting knowledge from pdf file...")
    knowledge_extraction()

Class Overview

TextElement and Document are supporting data types for input to the pipeline function of DocumentProcessingClient.

TextElement

class TextElement(BaseModel):
    """
    Document element data config

    Attributes:
        id (str): Unique identifier for the element
        text (str): Text of the element
        page_number (int): Document page number from which element was extracted
        element_type (str): Type of element, one of (text, table, figure)
        text_vector (list): Embedding Vector for the element text
    """

    id: str = Field(..., description="Unique identifier for the element")
    text: str = Field(..., description="Text from the element")
    page_number: int = Field(
        ..., description="Document page number from which element was extracted"
    )
    element_type: Literal["text", "table", "figure"] = Field(
        ..., description="Type of element"
    )
    text_vector: List = Field(
        default=[], description="Embedding Vector for the element text"
    )

Attributes

  • id - Unique identifier for the element
  • text - Text from the element
  • page_number - Document page number from which element was extracted
  • element_type (Literal["text", "table", "figure"]) - Type of element, can be: text, table, figure
  • text_vector- Embedding Vector for the element text (default: [])

Document

class Document(BaseModel):
    """
    Document Object data class.

    Attributes:
        filename (str): Name of the file
        file_type (str): File type/extension
        elements (list): List of file elements
        metadata (dict): Metadata related to the document
    """

    filename: str = Field(..., description="Name of the file")
    file_type: str = Field(..., description="File type/extension")
    elements: List[TextElement] = Field(..., description="List of document elements")
    metadata: dict = Field(default={}, description="Metadata related to the document")

Attributes

  • filename - Name of the file
  • file_type - File type/extension
  • elements (List[TextElement]) - List of document elements
  • metadata - Metadata related to the document (default={})

DocumentProcessingConfig

The DocumentProcessingConfig class provides a configuration for document processing. This is needed as parameter to DocumentProcessingClient

class DocumentProcessingConfig(BaseModel):
    """
    Configuration for document processing
    """

    upload_config: VectorDBUploadConfig = Field(
        default=VectorDBUploadConfig(), description="Vector DB upload configuration"
    )
    vectordb_config: VectorDBConfig = Field(..., description="Vector DB configuration")
    embedding_config: EmbeddingConfig = Field(
        ..., description="Embedding configuration"
    )
    chunking_config: ChunkingConfig = Field(
        ..., description="Chunking parameter configuration"
    )

Attributes

  • upload_config (VectorDBUploadConfig) - vector database upload configuration
    • batch_size - Number of rows in a batch per upload request (default=50)
    • max_workers - Number of parallel threads to spawn while uploading rows to vector DB
  • vectordb_config (VectorDBConfig) - vector database configuration
    • type - Type of the Vector DB (default="AzureAISearch")
    • base_url - Vector DB URL
    • api_key - API key required to access the vector DB
    • api_version - API Version
    • index - Name of the vector db index
    • embedding_column - Name of the column in the index that stores embeddings for vector searches (default="text_vector")
    • top_k - Number of top results (k) to return from each vector search request (default=1)
    • content_column - List of columns from which content should be returned in search results and columns which are to be populated in the vector DB, values are retrieved from TextElement objects or metadata of Document objects (default=[])
    • timeout - Vector DB POST request timeout in seconds (default=60)
  • embedding_config (EmbeddingConfig) - embedding configuration
    • model- Name of the model to use for embedding, use only the ones that are available on AI Refinery
    • batch_size - Number of rows in a batch per embedding request (default=50)
    • max_workers - Number of parallel threads to spawn while creating embeddings (default=8)
  • chunking_config (ChunkingConfig) - chunking parameter configuration
    • algorithm - Type of Chunking Algorithm, options: BruteForceChunking, SemanticChunking
    • chunk_size - Max length per chunk
    • overlap_size - Overlap between two neighboring chunks (default = 0)

DocumentProcessingClient

The DocumentProcessingClient class provides an interface for interacting with the AI Refinery's knowledge extraction service, allowing users to extract knowledge from input documents (text/tables/images) from 5 types of input files: PPTX, PDF, DOCX, PPT, DOC. AIRefinery.document_processing is of type DocumentProcessingClient.

class DocumentProcessingClient:
    """
    Interface for interacting with the AI Refinery's knowledge extraction service,
    allowing users to extract knowledge from input documents.
    """

Methods

__init__

Initializes the DocumentProcessingClient instance with optional base_url parameter

def __init__(
        self, *, base_url: str = ""
    ) -> None:
    ...
Parameters:

create_project

Initializes and sets up a knowledge extraction project based on the provided configuration.

def __init__(
    self, doc_process_config: DocumentProcessingConfig
) -> None:
Parameters:
  • doc_process_config (DocumentProcessingConfig): Configuration for document processing of type DocumentProcessingConfig, this field is required.

parse_document

Extract text/(multimedia) from the given document using the specified knowledge-extraction model.

async def parse_document(self, *, file_path: str, model: str, timeout: int | None = None) -> Optional[dict]:
    ...
Parameters:
  • file_path (str): local path of input files
  • model (str): name of the knowledge extraction model to be used (either knowledge-brain/knowledge-brain or nv-ingest/nv-ingest); knowledge-brain returns document summary in addition to the extracted document text; nv-ingest returns results faster. Knowledge-brain can be used on broader set of file types, like PDF, PPTX, DOCX, DOC, PPT, while nv-ingest can be used for PDF, PPTX and DOCX only
  • timeout (Optional[Union[int, None]]) defaults to None: Timeout of the document extraction request, in seconds. If set to None, the default configured default timeout gets used. Increase this parameter according to the content/pages in the document.
Returns:
  • dict:

    • If successful, returns a dictionary containing the extracted document elements:

      • text (str): Combined extracted text content from the document
      • summaries (dict): Summaries of the document content (included only for model='knowledge-brain').
      • diagrams (List[str]): List of base64-encoded image strings, if any
      • tables (List[str]): Structured table data, if any (included only for model='nv-ingest').
      • file_url (str): URL to the source document (only for model='knowledge-brain')
    • If unsuccessful, returns a dictionary with a single key:

      • error (str): Description of the error or reason for failure.

pipeline

Performs a list of tasks specified by the user on a list of documents. Currently supported tasks are: "chunk", "embed", "upload".

  • chunk - can perform either brute force, i.e. splitting text into fixed-length chunks or semantic chunking (split based on similarity rather than fixed length) of input documents
  • embed - converting chunks of text (from documents) into dense vector representations using an embedding model
  • upload - uploads a final chunked + embedded document data to vector database
def pipeline(
        self, doc_list: List[Document], task_list: List[str]
    ) -> Dict[str, bool]:
    ...
Parameters:
  • doc_list (List(Document)): A list of type Document to be processed
  • task_list (List[str]): A list of tasks that user want to perform. Currently supported tasks are: "chunk", "embed", "upload". To be supported: "de-id", "translate"
Returns:
  • Dict[str, bool]: A dictionary indicating whether each task successfully done on all documents. True: completed successfully on all documents. False: otherwise