Knowledge Extraction API¶
This documentation provides an overview of our Knowledge Extraction API. This API allows users to extract knowledge from various formats of input documents which typically contain text, tables, and figures. The extracted knowledge will be converted to compact embedding space and stored in a specified vector database. This facilitates the downstream knowledge search RAG applications, e.g., by using AI Refinery's built-in research agent.
Example Usage¶
In this example we show how to create the DocumentProcessingClient
object using the unified AIRefinery
client and use methods parse_document
to parse input documents as well as pipeline
to perform a series of operations on the parsed documents. The end result is a vector database ready with all extracted knowledge.
The knowledge extraction functionality is exposed via AIRefinery
client and this example demonstrate how to access it.
import os
import uuid
from air import login
from air.api.vector_db import VectorDBConfig
from air.client import AIRefinery
from air.types import Document, TextElement, ChunkingConfig, EmbeddingConfig, VectorDBUploadConfig, DocumentProcessingConfig
api_key = str(os.getenv("API_KEY"))
auth = login(
account=str(os.getenv("ACCOUNT")),
api_key=api_key,
)
base_url = os.getenv("AIREFINERY_ADDRESS", "")
vectordb_config = VectorDBConfig(base_url="https://<service_base_url>.search.windows.net", api_key="<your-api-key?", api_version="2023-11-01", index="<your-api-version>")
upload_config = VectorDBUploadConfig(batch_size=50, max_workers=1)
embedding_config = EmbeddingConfig(model="intfloat/e5-mistral-7b-instruct", batch_size=32, max_workers=1)
chunking_config = ChunkingConfig(algorithm="BruteForceChunking", chunk_size=10, overlap_size=0)
# Create a unified AIRefinery client
client = AIRefinery(base_url=base_url, api_key=api_key)
# get document processing client from unfied airefinery client
document_processing_client = client.knowledge.document_processing
# create document processing configuration
doc_process_config = DocumentProcessingConfig(upload_config=upload_config, vectordb_config=vectordb_config, embedding_config=embedding_config, chunking_config=chunking_config)
# configure document processing project with configuration
document_processing_client.create_project(doc_process_config=doc_process_config) # type: ignore
def knowledge_extraction():
print("Example of parse_documents:\n")
# Choose a model: "nv-ingest/nv-ingest" or "knowledge-brain/knowledge-brain"
extraction_model = "knowledge-brain/knowledge-brain"
# path to the local file
file_path = "<path-to-your-file>"
try:
# parse documents: extract content from the given document using the specified extraction model
# set timeout in seconds, increase timeout according to file content/pages
response = document_processing_client.parse_document(
file_path=file_path, model=extraction_model, timeout=300
)
except Exception as e:
print(f"Failed to extract knowledge. {e}")
return
print(f"This is the response of parse_documents method: {response}")
print("Example of pipeline:\n")
text_element = TextElement(
id=str(uuid.uuid4()),
text=response["text"],
page_number=1,
element_type="text",
text_vector=[],
)
# create Document object for pipeline
doc = Document(
filename=os.path.basename(file_path),
file_type="PDF",
elements=[text_element],
metadata={},
)
documents = [doc]
# list of tasks to perform in pipeline
pipeline_steps = ["chunk", "embed", "upload"]
# execute pipeline: chunk, embed and upload from the list of documents
status_dict = document_processing_client.pipeline(documents, pipeline_steps)
print(f"Response of pipeline: {status_dict}")
if __name__ == "__main__":
print("\nExample of extracting knowledge from pdf file...")
knowledge_extraction()
Class Overview¶
TextElement
and Document
are supporting data types for input to the pipeline
function of DocumentProcessingClient
.
TextElement
¶
class TextElement(BaseModel):
"""
Document element data config
Attributes:
id (str): Unique identifier for the element
text (str): Text of the element
page_number (int): Document page number from which element was extracted
element_type (str): Type of element, one of (text, table, figure)
text_vector (list): Embedding Vector for the element text
"""
id: str = Field(..., description="Unique identifier for the element")
text: str = Field(..., description="Text from the element")
page_number: int = Field(
..., description="Document page number from which element was extracted"
)
element_type: Literal["text", "table", "figure"] = Field(
..., description="Type of element"
)
text_vector: List = Field(
default=[], description="Embedding Vector for the element text"
)
Attributes¶
id
- Unique identifier for the elementtext
- Text from the elementpage_number
- Document page number from which element was extractedelement_type
(Literal["text", "table", "figure"]) - Type of element, can be: text, table, figuretext_vector
- Embedding Vector for the element text (default: [])
Document
¶
class Document(BaseModel):
"""
Document Object data class.
Attributes:
filename (str): Name of the file
file_type (str): File type/extension
elements (list): List of file elements
metadata (dict): Metadata related to the document
"""
filename: str = Field(..., description="Name of the file")
file_type: str = Field(..., description="File type/extension")
elements: List[TextElement] = Field(..., description="List of document elements")
metadata: dict = Field(default={}, description="Metadata related to the document")
Attributes¶
filename
- Name of the filefile_type
- File type/extensionelements
(List[TextElement]) - List of document elementsmetadata
- Metadata related to the document (default={})
DocumentProcessingConfig
¶
The DocumentProcessingConfig
class provides a configuration for document processing. This is needed as parameter to DocumentProcessingClient
class DocumentProcessingConfig(BaseModel):
"""
Configuration for document processing
"""
upload_config: VectorDBUploadConfig = Field(
default=VectorDBUploadConfig(), description="Vector DB upload configuration"
)
vectordb_config: VectorDBConfig = Field(..., description="Vector DB configuration")
embedding_config: EmbeddingConfig = Field(
..., description="Embedding configuration"
)
chunking_config: ChunkingConfig = Field(
..., description="Chunking parameter configuration"
)
Attributes¶
upload_config
(VectorDBUploadConfig) - vector database upload configurationbatch_size
- Number of rows in a batch per upload request (default=50)max_workers
- Number of parallel threads to spawn while uploading rows to vector DB
vectordb_config
(VectorDBConfig) - vector database configurationtype
- Type of the Vector DB (default="AzureAISearch")base_url
- Vector DB URLapi_key
- API key required to access the vector DBapi_version
- API Versionindex
- Name of the vector db indexembedding_column
- Name of the column in the index that stores embeddings for vector searches (default="text_vector")top_k
- Number of top results (k) to return from each vector search request (default=1)content_column
- List of columns from which content should be returned in search results and columns which are to be populated in the vector DB, values are retrieved from TextElement objects or metadata of Document objects (default=[])timeout
- Vector DB POST request timeout in seconds (default=60)
embedding_config
(EmbeddingConfig) - embedding configurationmodel
- Name of the model to use for embedding, use only the ones that are available on AI Refinerybatch_size
- Number of rows in a batch per embedding request (default=50)max_workers
- Number of parallel threads to spawn while creating embeddings (default=8)
chunking_config
(ChunkingConfig) - chunking parameter configurationalgorithm
- Type of Chunking Algorithm, options: BruteForceChunking, SemanticChunkingchunk_size
- Max length per chunkoverlap_size
- Overlap between two neighboring chunks (default = 0)
DocumentProcessingClient
¶
The DocumentProcessingClient
class provides an interface for interacting with the AI Refinery's knowledge extraction service, allowing users to extract knowledge from input documents (text/tables/images) from 5 types of input files: PPTX, PDF, DOCX, PPT, DOC. AIRefinery.document_processing
is of type DocumentProcessingClient
.
class DocumentProcessingClient:
"""
Interface for interacting with the AI Refinery's knowledge extraction service,
allowing users to extract knowledge from input documents.
"""
Methods¶
__init__
¶
Initializes the DocumentProcessingClient
instance with optional base_url
parameter
Parameters:¶
base_url
(Optional[str]): Base URL for the API. Defaults to "https://api.airefinery.accenture.com" if not provided.
create_project
¶
Initializes and sets up a knowledge extraction project based on the provided configuration.
Parameters:¶
doc_process_config
(DocumentProcessingConfig): Configuration for document processing of typeDocumentProcessingConfig
, this field is required.
parse_document
¶
Extract text/(multimedia) from the given document using the specified knowledge-extraction model.
async def parse_document(self, *, file_path: str, model: str, timeout: int | None = None) -> Optional[dict]:
...
Parameters:¶
file_path
(str): local path of input filesmodel
(str): name of the knowledge extraction model to be used (either knowledge-brain/knowledge-brain or nv-ingest/nv-ingest); knowledge-brain returns document summary in addition to the extracted document text; nv-ingest returns results faster. Knowledge-brain can be used on broader set of file types, like PDF, PPTX, DOCX, DOC, PPT, while nv-ingest can be used for PDF, PPTX and DOCX onlytimeout
(Optional[Union[int, None]]) defaults to None: Timeout of the document extraction request, in seconds. If set to None, the default configured default timeout gets used. Increase this parameter according to the content/pages in the document.
Returns:¶
-
dict
:-
If successful, returns a dictionary containing the extracted document elements:
text
(str): Combined extracted text content from the documentsummaries
(dict): Summaries of the document content (included only for model='knowledge-brain').diagrams
(List[str]): List of base64-encoded image strings, if anytables
(List[str]): Structured table data, if any (included only for model='nv-ingest').file_url
(str): URL to the source document (only for model='knowledge-brain')
-
If unsuccessful, returns a dictionary with a single key:
error
(str): Description of the error or reason for failure.
-
pipeline
¶
Performs a list of tasks specified by the user on a list of documents. Currently supported tasks are: "chunk", "embed", "upload".
- chunk - can perform either brute force, i.e. splitting text into fixed-length chunks or semantic chunking (split based on similarity rather than fixed length) of input documents
- embed - converting chunks of text (from documents) into dense vector representations using an embedding model
- upload - uploads a final chunked + embedded document data to vector database
Parameters:¶
doc_list
(List(Document)): A list of type Document to be processedtask_list
(List[str]): A list of tasks that user want to perform. Currently supported tasks are: "chunk", "embed", "upload". To be supported: "de-id", "translate"
Returns:¶
Dict[str, bool]
: A dictionary indicating whether each task successfully done on all documents. True: completed successfully on all documents. False: otherwise