Skip to content

Explore the Capabilities of the Knowledge Extraction Agent

Overview

The Knowledge Extraction API allows users to send a document and then extract the knowledge/information contained within the documents. It can perform knowledge extraction tasks for various formats of files, including pdf, ppt, and doc. Currently, knowledge extraction supports extraction of information that is in the form of texts, tables and figures.

Goals

The goals of this tutorial are to demonstrate how to use the AIRefinery client to extract information from a document and explain the output parameters. By the end, you will know how to use the AIRefinery client to extract knowledge from a list of your documents and how to consume it for your downstream tasks e.g., by a Research Agent.

Configuration

In this tutorial, we need two configuration files:

  1. rag_example_knowledge.yaml to configure parameters for AIRefinery.knowledge.document_processing to convert documets to searchable knowledge, and

  2. example_distiller.yaml to set up a AI Refinery project with one custom agent, namely "Knowledge Build Agent", to call AIRefinery.knowledge.document_processing, and one built-in agent "Knowledge QA Agent", essentially an AI Research agent, to answer user's questions based on the knowledge created by the first agent.

Here is the rag_example_knowledge.yaml, which specifies how we would like AIRefinery.knowledge.document_processing to divide each big document into smaller manageable pieces, converting their text into the embedding space, and upload to a vector database. Here you can find explanation of each attribute and how to correctly configure YAML file.

yaml-schema: knowledge-local # 

embedding_config:
  model: embedding_model
  batch_size: 32
  max_workers: 2

vectordb_config:
  type: AzureAISearch
  base_url: <your_service_url>
  api_key: <service_api_key>
  index: <ai_search_index_name>
  api_version: 2023-11-01
  embedding_column: text_vector
  top_k: 1
  content_column:
    - id
    - text
  timeout: 10

upload_config:
  batch_size: 50
  max_workers: 2

chunking_config:
  algorithm: BruteForceChunking
  chunk_size: 50
  overlap_size: 0

Below is the example_distiler.yaml, which specifies the details of the two agents in the agentic workflow:

orchestrator:
  agent_list:
    - agent_name: "Knowledge Build Agent"
    - agent_name: "Knowledge QA Agent"


utility_agents:
  - agent_class: CustomAgent
    agent_name: "Knowledge Build Agent"
    agent_description: "This agent parses specified files, extracts knowledge, and acccordingly creates a knowledge database."
    config: {}

  - agent_class: ResearchAgent
    agent_name: Knowledge QA Agent
    agent_description: |
      This agent answers questions based on knowledge in its database.
    config:
      reranker_top_k: 2
      compression_rate: 1

      retriever_config_list:
        - retriever_name: "knowledge test database"
          retriever_class: AzureAISearchRetriever
          description: "Knowledge base built upon technical documents"

          aisearch_config:
            base_url: <your_service_url>
            api_key: <service_api_key>
            index: <ai_search_index_name>

            embedding_column: "text_vector"
            embedding_config:
              model: "intfloat/e5-mistral-7b-instruct"

            top_k: 4

            content_column:
              - "id"
              - "text"

Python File

To utilize the Knowledge Extraction API, you need the local file path of the documents to extract and the knowledge-extraction model to be used. The code snippet below uses the AIRefinery.knowledge.document_processing to extract knowledge from a folder of PDF files. You can apply this code snippet to any valid documents.

import logging
import os
import uuid

from omegaconf import OmegaConf

from air import login
from air.client import AIRefinery, AsyncAIRefinery
from air.types import Document, DocumentProcessingConfig, TextElement

logger = logging.getLogger(__name__)

api_key = str(os.getenv("API_KEY"))
auth = login(
    account=str(os.getenv("ACCOUNT")),
    api_key=api_key,
    oauth_server=os.getenv("OAUTH_SERVER", ""),
)
base_url = os.getenv("AIREFINERY_ADDRESS", "")

rag_config = OmegaConf.load("rag_example_knowledge.yaml")
client = AIRefinery(base_url=base_url, api_key=api_key)
document_processing_client = client.knowledge.document_processing
document_processing_client.create_project(doc_process_config=DocumentProcessingConfig(**rag_config))  # type: ignore

async_client = AsyncAIRefinery(base_url=base_url, api_key=api_key) # distiller is available only in asycn AIRefinery client


async def knowledge_build_agent(query: str):
    """
    Document upload agent
    """
    source_files_folder = "test_files"
    ocr_model = (
        "knowledge-brain/knowledge-brain"  # Other available model: nv-ingest/nv-ingest
    )
    documents = []
    try:
        print("\n%%% AGENT Knowledge Build Agent %%%\nParsing documents...\n")
        for filename in os.listdir(source_files_folder):
            # parse documents: extract content from the given document using the specified ocr model and prepare documents for pipeline
            document_parsing_response = document_processing_client.parse_document(
                file_path=os.path.join(source_files_folder, filename), model=ocr_model, timeout=300
            )
            if "error" in document_parsing_response:
                return "Error in document parsing"
            # Convert response to Document to use in pipeline
            text_element = TextElement(
                id=str(uuid.uuid4()),
                text=document_parsing_response["text"],
                page_number=1,
                element_type="text",
            )
            document = Document(
                filename=filename, file_type="PDF", elements=[text_element]
            )
            documents.append(document)
        print("%%% AGENT Knowledge Build Agent %%%\nRunning Index upload pipeline...\n")
        pipeline_steps = ["chunk", "embed", "upload"]
        # execute pipeline: chunk, embed and upload to vector db from the list of documents
        status_dict = document_processing_client.pipeline(documents, pipeline_steps)
        if False in status_dict.values():
            logger.error("Index upload pipeline failed")
            return "Index upload pipeline failed"
        return "Completed processing and uploading all available documents"
    except Exception as e:
        err_msg = f"[Knowledge_build_agent] document processing and uploading failed. Exception {e}"
        logger.error(err_msg)
        response = "Cannot complete"
    return response


if __name__ == "__main__":
    distiller_client = async_client.distiller

    PROJECT = "knowledge_rag"

    distiller_client.create_project(
        config_path="example_distiller.yaml", project=PROJECT
    )
    executor_dict = {
        "Knowledge Build Agent": knowledge_build_agent,
    }
    distiller_client.interactive(
        project=PROJECT, uuid="test", executor_dict=executor_dict  # type: ignore
    )

Result

The Knowledge Build Agent processes files located in test_files/, extracts relevant knowledge, and constructs a knowledge database. You can trigger this agent with a prompt like "extract knowledge from my files/please upload knowledge to database", prompting it to parse the specified documents. The content is then chunked, embedded, and stored in a vector database. The Research QA Agent uses this vector database to answer user queries based on the extracted and structured knowledge.

Knowledge Extraction and QA agents