Knowledge Graph API¶
The Knowledge Graph API enables users to use the knowledge extracted from their files to build, analyze and visualize knowledge graphs. This could then be used to build custom-agents that perform knowledge related Question-Answering tasks.
Note: Users will have to deploy their own LLM and Embedding models that could be accessible through either an OpenAI client or an AzureOpenAI client. The AI Refinery deployment is currently not supported by the Knowledge Graph API.
Note: To install the packages required for Knowledge Graph API, run pip install "<path-to-air-sdk-whl-file>[knowledge]"
Example Usage¶
In this example, we show how to:
- Create and initialize a knowledge graph object using the async unified AIR client
AsyncAIRefinery
- Add and update knowledge to the graph using methods such as
create_project
,build
,update
- Visualize the knowledge using the
visualize
method
Before running the code, set the following env variables:
KNOWLEDGE_GRAPH_API_BASE_URL
: base url where the LLM and embedding models are deployed, the url must be accessible through an OpenAI or AzureOpenAI client.KNOWLEDGE_GRAPH_API_KEY
: corresponding API key required to access the models
Note: User will have to deploy their own models; AI Refinery deployment URL is not supported.
import os
import asyncio
from air import AsyncAIRefinery, login
from air.types import Document, KnowledgeGraphConfig, TextElement
api_key = str(os.getenv("API_KEY"))
auth = login(
account=str(os.getenv("ACCOUNT")),
api_key=api_key,
)
base_url = os.getenv("AIREFINERY_ADDRESS", "")
# Initialize the AsyncAIRefinery client
air_client = AsyncAIRefinery(base_url=base_url, api_key=api_key)
async def build_visualize_graph():
# Initializing a KnowledgeGraphConfig object to be passed
# while initializing the KnowledgeGraphClient object
knowledge_graph_config = KnowledgeGraphConfig(
type="GraphRAG", # type of knowledge graph, choice between `GraphRAG` and `FastGraphRAG`
work_dir="../graph_work_dir", # folder where all the knowledge-graph related files and work-product will be stored
api_type="azure", # type of model deployment, choice between `openai` and `azure`
llm_model="deployed-llm-model", # LLM model to be used to build the knowledge graph and for answering queries
embedding_model="deployed-embedding-model", # Embedding model to be used to generate embeddings of text-chunks and queries
chunk_size=1200, # Size of text-chunks
chunk_overlap=200, # Size of overlap between chunks
)
# Initializing the knowledge graph client object
# the get_graph() method returns an object of type KnowledgeGraphClient
# refer below for the docs
knowledge_graph_client = await air_client.knowledge.get_graph()
# Creating a project space for the knowledge graph and initializing it with the KnowledgeGraphConfig
knowledge_graph_client.create_project(graph_config=knowledge_graph_config)
# Calling the build method to build the knowledge-graph using the files in the `path-to-folder` folder
build_status = await knowledge_graph_client.build(files_path="path-to-folder")
if not build_status:
print("Build Failed!!!")
return
# Update knowledge graph using list of Document elements
# the texts from TextElements of type `text` within a Document element are going to be combined
# and then chunked into smaller text units.
# User can pass any number of Document elements each with any number of TextElements.
sample_docs = [
Document(
filename="test_document",
file_type="pdf",
elements=[
TextElement(
id="test-doc-id",
text="The Sun is the star at the heart of our solar system. The sun is about 109 times the diameter of Earth and over 330,000 times its mass. It generates energy through nuclear fusion at its core, where temperatures and pressures are unimaginably high. The Sun consists mainly of the elements hydrogen and helium. At this time in the Sun's life, they account for 74.9% and 23.8%, respectively, of the mass of the Sun in the photosphere. Earth is the 3rd planet in the Solar System. The Solar System contains 9 planets and one star at the center, which is the Sun. All the planets in the Solar System revolve around the Sun at various speeds and orbits.", # pylint:disable=line-too-long
page_number=1,
element_type="text",
)
],
)
]
# Calling the `update` method to update the existing knowledge-graph with new knowledge
# from the list of document elements
update_status = await knowledge_graph_client.update(docs=sample_docs)
if not update_status:
print("Update Failed!!!")
return
# Visualize the knowledge graph, set maximum nodes in community, community level to be visualized
# Look for a graph.svg file in the work_dir/output folder
visualize_status = knowledge_graph_client.visualize(
max_community_size=3, community_level=-1
)
# Running a query against the knowledge built so far, using the `local` search method
query_response = await knowledge_graph_client.query(query="What is the Sun made of", method="local")
print(query_response)
if __name__ == "__main__":
asyncio.run(build_visualize_graph())
Class Overview¶
KnowledgeGraphConfig
¶
class KnowledgeGraphConfig(BaseModel):
"""
KnowledgeGraph configuration class
"""
type: str = Field(default="GraphRAG", description="Type of the Knowledge Graph")
work_dir: str = Field(
default="graph_dir", description="Workspace directory for the knowledge graph"
)
api_type: Literal["openai", "azure"] = Field(
default="openai",
description="API type of deployed LLM",
)
chunk_size: int = Field(default=1200, description="Size of text chunks")
chunk_overlap: int = Field(default=100, description="Overlap between text chunks")
llm_model: str = Field(
default="meta-llama/Llama-3.1-70B-Instruct",
description="LLM model to use for knowledge graph tasks",
)
embedding_model: str = Field(
default="intfloat/e5-mistral-7b-instruct",
description="Embedding model to use for knowledge graph tasks",
)
Attributes¶
type
- Type of knowledge graph algorithm, available options are:GraphRAG
andFastGraphRAG
.- GraphRAG uses LLM calls throughout the graph building and query answering process.
- FastGraphRAG uses nltk based NLP models for entity and relationship extraction and uses LLM calls for community detection, community report generation and query answering.
work_dir
- Path where the output and files generated during the graph building process will be stored. The resultinggraph.graphml
file and the visualization resultgraph.svg
files will be stored under thework_dir/output/
folderapi_type
- Type of the LLM and embedding model deployment API, must be eitheropenai
orazure
chunk_size
- Size of text chunks, defaults to 1200chunk_overlap
- Size of overlap between text chunks, defaults to 200llm_model
- LLM model to be used for the graph building and query answering process. Used for,- Extracting entities and relationships (only for
GraphRAG
) - Generating Community reports (communities are determined through clustering)
- Answering queries
- Extracting entities and relationships (only for
embedding_model
- Model used to generate embeddings of the text chunks, and the query text. Embeddings are used to perform RAG to aid the answer generation. Embeddings are stored in a local vector DB (lancedb).
KnowledgeGraphClient
¶
The KnowledgeGraphClient
class provides an interface for the user to build a knowledge graph from their documents, also allowing them to update the knowledge subsequently, visualize the knowledge graph at various community levels and query the graph. AsyncAIRefinery.knowledge.get_graph()
returns an object of KnowledgeGraphClient
class KnowledgeGraphClient:
"""
Interface for interacting with the AI Refinery's knowledge extraction service,
allowing users to extract knowledge from input documents.
"""
Methods¶
create_project
¶
Initializes and sets up a knowledge graph project based on the provided configuration.
Parameters:¶
graph_config
(KnowledgeGraphConfig): Configuration for knowledge graph of typeKnowledgeGraphConfig
, this field is required.
build
¶
Method to build the knowledge graph from either files from the given folder or from list of Document
elements. If the graph already exists, the method will fail and return False
.
async def build(
self,
files_path: str | None = None,
docs: List[Document] | None = None,
) -> bool:
Parameters:¶
files_path
(str): Folder containing '.txt' files that are to be used for building the knowledge graph. If this is not set,docs
argument is required.docs
(list[Document]): List ofDocument
elements, whosetext
type elements would be added to the knowledge graph. Check the Document class definition here. If this is not set,files_path
argument is required.
Returns:¶
bool
: If successful returns True else False
update
¶
Method to update the knowledge graph from either files from the given folder or from list of Document
elements. The build
method should have been run and the knowledge graph should have been created under the work_dir
, the folder set in the KnowledgeGraphConfig where all the knowledge-graph related files are stored, before this method is run.
This method can only add knowledge to the pre-existing graph, it cannot remove the pre-existing knowledge.
async def update(
self,
files_path: str | None = None,
docs: List[Document] | None = None,
) -> bool:
Parameters:¶
files_path
(str): Folder containing '.txt' files that are to be used for updating the knowledge graph. If this is not set,docs
argument is required.docs
(list[Document]): List ofDocument
elements, whosetext
type elements would be added to the knowledge graph. Check the Document class definition here. If this is not set,files_path
argument is required.
Returns:¶
bool
: If successful returns True else False
query
¶
Method to query the knowledge graph and get an answer.
Parameters:¶
query
(str): Query stringmethod
(str): Search method to use to generate the answer to the query. Available options arebasic
,local
,global
,drift
.- basic - Similar to basic RAG, creates embedding of query and retrieves relevant text chunks by comparing the query embedding against the text chunk embeddings. Passes the retrieved chunks to the LLM to generate an answer to the query.
- local - The local search method combines structured data from the knowledge graph with unstructured data from the input documents to augment the LLM context with relevant entity information at query time. It is well-suited for answering questions that require an understanding of specific entities mentioned in the input documents for e.g., "What are the healing properties of chamomile?"
- global - The global search method uses the LLM-generated, pre-summarized, meaningful semantic clusters to answer the user query. This method is most useful when answering questions that are related to the broader theme of the data/knowledge, for e.g., "What are the top 5 themes in the data?"
- drift - DRIFT search (Dynamic Reasoning and Inference with Flexible Traversal) uses community report information, local search and follow-up questions to generate content-rich answers. This method would be much helpful in generating answers to queries that are about a specific entity and the user expects an answer that paints a bigger picture of the entity, for e.g., "What is AI Refinery?"
Returns:¶
Union[str, None]
: If successful, returns the generated answer to the query, else returns None.
visualize
¶
Function to visualize the graph and generate an SVG image of the graph. Uses the graph.graphml
file, generated by the build
and update
methods, under work_dir/output
folder. Set optional parameters to cluster and/or filter the graph before visualizing.
In the resulting svg file,
- Nodes of the same color in a connected component belong to the same community.
- Lighter colored edges carry more weight.
- Darker colored edges carry less weight.
def visualize(
self,
max_community_size: int | None = None,
community_level: int | None = None,
figsize: tuple[float, float] = (36.0, 20.0),
default_node_sizes: int = 500,
fig_format: str = "svg",
dpi: int = 300,
font_size: int = 10,
scale_factor: int = 20,
) -> bool:
Parameters¶
max_community_size
(Optional[int]): Maximum number of nodes to be present in a cluster/community. If set as None, clustering is skipped. Defaults to None. On some occasions a cluster may contain more thanmax_community_size
number of nodes if it cannot be broken down further.community_level
(Optional[int]): Level of the community to retain. If value is greater than largest community level in the graph, then all nodes are retained.figsize
(Optional[tuple[float, float]]): The (width, height) of the matplotlib figure, in inches. Default is (36.0, 20.0).default_node_sizes
(Optional[int]): Default size for nodes if not specified in the graphml node attributes. Default is 500.fig_format
(Optional[str]): The format for the output image file. Common values: 'svg', 'png', 'pdf', etc. Default is 'svg'.dpi
(Optional[int]): Dots per inch for the output image, controlling resolution. Default is 300.font_size
(Optional[int]): Font size for node labels in the plot. Default is 10.scale_factor
(Optional[int]): Factor for scaling the size of nodes. Default is 20.
Returns¶
bool
: If successful returns True else False