Retrievers Gallery¶
Explore the retrievers supported by the ResearchAgent of the AI Refinery SDK, designed to fetch relevant information from various sources based on user queries. Supported retrievers include:
WebSearchRetriever: Access real-time web data.AzureAISearchRetriever: Perform semantic search over Azure hosted vector database index.ElasticSearchRetriever: Employ Elasticsearch for scalable search solutions.OpenSearchRetriever: Perform semantic, keyword, or hybrid search over OpenSearch-hosted indices.LLMsTxtRetriever: Retrieve relevant content from a website guided by its llms.txt file.CustomRetriever: Create you own retrievers, tailored for specific needs.
WebSearchRetriever¶
The WebSearchRetriever is designed to perform web searches using external search engines. The currently supported search engine is Google Search. It is ideal for retrieving the latest information public information from the internet.
Configuration Template¶
Here is the configuration template for the WebSearchRetriever:
- retriever_name: <your-retriever-name> # Required: A custom name for this retriever instance
retriever_class: WebSearchRetriever # Required: Specifies use of the web search retriever
description: <optional-description> # Optional: Brief description of what this retriever is used for
query_transformation_examples: # Optional: Helps transform complex user queries into effective web search queries
- user_query: <example-user-query>
query:
- <transformed-query-1>
- <transformed-query-2>
source_weight: <weight> # Optional: Importance weight relative to other retrievers (default: 1.0)
Use Case¶
The WebSearchRetriever is well-suited for retrieving publicly available information from the open internet, similar to a traditional search engine. Typical use cases include:
- General knowledge and fact-finding
- News updates and trending topics
- Technical explanations or documentation
- Comparative research on tools, services, or ideas
- Any query requiring up-to-date or web-accessible content
AzureAISearchRetriever¶
The AzureAISearchRetriever is designed to perform vector-based searches over an index hosted on Azure. It is ideal for retrieving information from pre-indexed datasets.
Configuration Template¶
Here are the configuration template for the AzureAISearchRetriever:
- retriever_name: <your-retriever-name> # Required: A custom name for this retriever instance
retriever_class: AzureAISearchRetriever # Required: Use this retriever for Azure-hosted vector search
description: <optional-description> # Optional: Brief explanation of what this retriever is used for
aisearch_config:
base_url: <your-base-url> # Required: Base URL of your Azure vector search endpoint
api_key: <your-api-key> # Required: Azure AISearch service API key
index: <your-index-name> # Required: Name of the vector index to search
embedding_column: <embedding-column-name> # Required: Column in your index containing embedded data
embedding_config:
model: <embedding-model-name> # Required: Must match the model used during indexing
top_k: <number-of-results> # Optional: Number of top documents to retrieve
content_column: # Required: Column(s) containing retrievable content
- <content-column-1>
- <content-column-2>
aggregate_column: <optional-aggregate-column> # Optional: Used to group chunks by document
meta_data: # Optional: Metadata fields to enrich the response
- column_name: <source-column-name> # Required within meta_data
load_name: <display-name> # Required within meta_data
query_transformation_examples: # Optional: User-to-search query examples for improved relevance
- user_query: <example-user-query>
query:
- <transformed-query-1>
- <transformed-query-2>
source_weight: <weight-value> # Optional: Importance weight relative to other retrievers (default: 1.0)
Use Case¶
The AzureAISearchRetriever is ideal for retrieving information from pre-indexed datasets via semantic search. It's best used in scenarios such as:
- Internal knowledge base queries
- Organizational content search
- Semantic search over embedded data
ElasticSearchRetriever¶
The ElasticSearchRetriever is designed to perform vector-based searches over an index hosted in ElasticSearch. It also works well for retrieving information from structured or pre-indexed datasets.
Configuration Template¶
Here is the configuration template for the ElasticSearchRetriever:
- retriever_name: <your-retriever-name> # Required: A custom name for this retriever instance
retriever_class: ElasticSearchRetriever # Required: Use this retriever for ElasticSearch-based vector search
description: <optional-description> # Optional: Brief explanation of what this retriever is used for
elasticsearch_config:
base_url: <your-elasticsearch-url> # Required: Endpoint of your ElasticSearch service
api_key: <your-api-key> # Required: Service API key
index: <your-index-name> # Required: Name of the ElasticSearch index
embedding_column: <embedding-column-name> # Required: Column storing vector embeddings
embedding_config:
model: <embedding-model-name> # Required: Must match the model used during data embedding
top_k: <number-of-results> # Optional: Number of top documents to retrieve
content_column: # Required: Column(s) containing content to retrieve
- <content-column-1>
- <content-column-2>
aggregate_column: <optional-aggregate-column> # Optional: Group chunks by original document
meta_data: # Optional: Metadata fields to include in results
- column_name: <metadata-field> # Required within meta_data
load_name: <display-label> # Required within meta_data
threshold: <float-between-0-and-1> # Optional: Filters out low-quality chunks (default: 0.9)
query_transformation_examples: # Optional: Transforms user queries for better search performance
- user_query: <example-user-query>
query:
- <transformed-query-1>
- <transformed-query-2>
source_weight: <weight-value> # Optional: Weight of this retriever relative to others (default: 1.0)
Use Case¶
The ElasticSearchRetriever is ideal for retrieving semantically relevant information from ElasticSearch-hosted content repositories. It excels in use cases such as:
- Internal knowledge base queries
- Organizational content search
- Semantic search over embedded data
OpenSearchRetriever¶
The OpenSearchRetriever is designed to perform searches over an index hosted in OpenSearch. It supports four search modes: K-Nearest Neighbors (KNN), keyword search (match), hybrid search (combined search), and filtered KNN (vector search with filters).
Search Modes¶
The OpenSearchRetriever supports four search modes:
-
knn(K-Nearest Neighbors): Pure vector similarity search using embeddings. Best for semantic search.- Requires:
embedding_column,embedding_config
- Requires:
-
match(Keyword Search): Traditional keyword matching using BM25 algorithm. No embeddings required.- Requires:
search_fields
- Requires:
-
hybrid(Combined Search): Combines vector similarity and keyword matching with configurable weights.- Requires:
embedding_column,embedding_config,search_fields,vector_weight,text_weightNote:
vector_weight + text_weightshould equal 1.0
- Requires:
-
filtered_knn(Vector Search with Filters): Vector search with metadata filtering that restricts results to documents where specific metadata fields match given values.- Requires:
embedding_column,embedding_config,filters
- Requires:
Configuration Template¶
Here is the configuration template for the OpenSearchRetriever:
- retriever_name: <your-retriever-name> # Required: A custom name for this retriever instance
retriever_class: OpenSearchRetriever # Required: Use this retriever for OpenSearch-based search
description: <optional-description> # Optional: Brief explanation of what this retriever is used for
opensearch_config:
# Connection settings
host: <opensearch-host> # Optional: OpenSearch host address (default: "localhost")
port: <opensearch-port> # Optional: OpenSearch port (default: 9200)
use_ssl: <true-or-false> # Optional: Use HTTPS connection (default: false)
verify_certs: <true-or-false> # Optional: Verify SSL certificates (default: false)
http_auth: [<username>, <password>] # Optional: Authentication credentials as a list
# Index settings
index: <your-index-name> # Required: Name of the OpenSearch index
embedding_column: <embedding-column-name> # Optional: Column storing vector embeddings (default: "embedding")
content_column: # Required: Column(s) containing content to retrieve
- <content-column-1>
- <content-column-2>
top_k: <number-of-results> # Optional: Number of top documents to retrieve (default: 10)
# Search mode configuration
search_mode: <search-mode> # Optional: "knn", "match", "hybrid", or "filtered_knn" (default: "knn")
# Embedding configuration (required for knn, hybrid, and filtered_knn modes)
embedding_config:
model: <embedding-model-name> # Required: Must match the model used during indexing
# Text search settings (required for match and hybrid modes)
search_fields: # Required for match/hybrid: Fields for text search
- <text-field-1>^<boost> # Optional boost factor (e.g., "title^2")
- <text-field-2>
# Hybrid mode settings (only for search_mode: "hybrid")
vector_weight: <weight-value> # Optional: Weight for vector component (default: 0.6)
text_weight: <weight-value> # Optional: Weight for text component (default: 0.4)
# Filtered KNN settings (only for search_mode: "filtered_knn")
filters: # Optional: Metadata filters for filtered_knn
<field-name>: <field-value> # Single value filter
# <field-name>: [<value1>, <value2>] # Multiple values (OR condition)
# Optional settings
aggregate_column: <optional-aggregate-column> # Optional: Group chunks by original document
meta_data: # Optional: Metadata fields to include in results
- column_name: <metadata-field> # Required within meta_data
load_name: <display-label> # Required within meta_data
threshold: <float-between-0-and-1> # Optional: Filters out low-quality chunks (default: 0.9)
query_transformation_examples: # Optional: Transforms user queries for better search performance
- user_query: <example-user-query>
query:
- <transformed-query-1>
- <transformed-query-2>
source_weight: <weight-value> # Optional: Weight of this retriever relative to others (default: 1.0)
Configuration Examples¶
Example 1: KNN (Semantic Search)
opensearch_config:
host: "localhost"
port: 9200
index: "documents"
embedding_column: "embedding"
content_column: ["content", "title"]
search_mode: "knn"
embedding_config:
model: "intfloat/e5-mistral-7b-instruct"
top_k: 5
Example 2: Match (Keyword Search)
opensearch_config:
host: "localhost"
port: 9200
index: "documents"
content_column: ["content", "title"]
search_mode: "match"
search_fields: ["title^2", "content"]
top_k: 5
Example 3: Hybrid Search
opensearch_config:
host: "localhost"
port: 9200
index: "documents"
embedding_column: "embedding"
content_column: ["content", "title"]
search_mode: "hybrid"
embedding_config:
model: "intfloat/e5-mistral-7b-instruct"
search_fields: ["title^2", "content"]
vector_weight: 0.6
text_weight: 0.4
top_k: 5
Example 4: Filtered KNN
opensearch_config:
host: "localhost"
port: 9200
index: "documents"
embedding_column: "embedding"
content_column: ["content", "title"]
search_mode: "filtered_knn"
embedding_config:
model: "intfloat/e5-mistral-7b-instruct"
filters:
category: "AI"
status: "published"
top_k: 5
Use Case¶
The OpenSearchRetriever is ideal for retrieving information from OpenSearch-hosted content repositories with flexible search strategies. It excels in use cases such as:
- Semantic Search: Use KNN mode for meaning-based retrieval from embedded data
- Keyword Search: Use match mode for traditional BM25-based text matching
- Hybrid Search: Combine semantic and keyword search for best search quality
- Filtered Search: Apply metadata filters to narrow down vector search results
- Internal Knowledge Bases: Query organizational content with multiple search strategies
- Document Management Systems: Retrieve documents using the most appropriate search method
Important Notes
- Embedding Model Consistency: The
embedding_config.modelmust match the model used to create embeddings in your OpenSearch index. Mismatched models will produce poor search results. - Pre-existing Data: The OpenSearch retriever expects data to already exist in your OpenSearch index. Use OpenSearch's native tools (bulk API, Logstash, etc.) to ingest and index data.
- Hybrid Mode Weights: When using
hybridmode, ensurevector_weight + text_weight = 1.0for proper score normalization.
LLMsTxtRetriever¶
The LLMsTxtRetriever is designed to retrieve content relevant to user queries in a website based on its llms.txt file.
Configuration Template¶
Here are the configuration template for the LLMsTxtRetriever:
- retriever_name: <your-retriever-name> # Required: A custom name for this retriever instance
retriever_class: LLMsTxtRetriever # Required: Use this retriever for llms.txt based search
description: <optional-description> # Optional: Brief explanation of what this retriever is used for
llms_txt_config:
llms_txt_url: <your-base-url> # Required: The direct URL to the llms.txt file
use_all_urls: false # Optional: If False (default), use urls relevant to the query; If True, use ALL URLs from llms.txt
url_filter_include_terms: ["model", "agent", "api"] # Optional: List of terms to include when filtering URLs from llms.txt; None (default) disables filtering
url_filter_exclude_terms: ["deprecated", "legacy"] # Optional: List of terms to exclude when filtering URLs from llms.txt; None (default) disables filtering
enable_async_query_page_analysis: true # If True (default), return query-relevant information generated by LLMs; otherwise, return raw content
enable_caching: true # Optional: If True (default), enables caching model responses.
source_weight: <weight-value> # Optional: Importance weight relative to other retrievers (default: 1.0)
Use Case¶
The LLMsTxtRetriever is ideal for retrieving information from websites with llms.txt files. With a llms.txt file, LLMsTxtRetriever can navigate websites with complex structures and retrieve relevant information for better retrieval augmented generation.
CustomRetriever¶
The CustomRetriever allows you to design retrievers tailored to your specific use-cases, enabling retrieval of information from unique or specialized data sources.
Configuration Template¶
Below is an example configuration for setting up a CustomRetriever:
- retriever_name: <your-retriever-name> # Required: A custom name for this retriever instance
retriever_class: CustomRetriever # Required. CustomRetriever is the type of retriever that retrieves relevant information from a vector database.
description: <optional-description> # Optional. A description of the retriever
# Any other arbritrary config that your CustomRetriever needs
your_arbitrary_config_1: <config-value>
your_arbitrary_config_2: <config-value>
your_arbitrary_config_n: <config-value>
Implementation Instructions¶
Retriever Function Template¶
You need to implement the logic for your CustomRetriever within a Python function. Below is the template for that function:
async def your_custom_retriever(query: str, your_arbitrary_config_1: Any, ..., your_arbitrary_config_n: Any) -> List[Dict[str, Any]]:
"""
Retrieves information based on the provided query.
Args:
query (str): The query string used to search for relevant information.
your_arbitrary_config_1 (Any): An arbitrary configuration parameter with unspecified type.
your_arbitrary_config_n (Any): Another arbitrary configuration parameter with unspecified type.
Returns:
List[Dict[str, Any]]: A list of dictionaries, each containing:
- "result" (str): A string representing the retrieved text content.
- "score" (int or float): A numeric relevance score indicating how well the result matches the query.
- "source" (str or None): A string representing an identifier for the source of the retrieved item, or None if not available.
Note: If an error occurs or no documents are found, return [{"result": "", "score": 0, "source": None}].
"""
pass
All the arbitrary configurations you specified in the retriever's YAML configuration will be passed as input arguments to this function. You will have access to these configurations within your retriever function.
⚠️ Warning: The previous output format with only "result" and "score" fields is still supported for existing implementations, but please update to the new format soon as the old format may be deprecated in future versions.
Integration to executor_dict¶
Once you've defined your retriever function, you need to incorporate it into the executor_dict of your project using the following format:
executor_dict = {
"<name-of-your-research-agent>": {
"<your-custom-retriever-name>": your_custom_retriever,
}
}
Use Case¶
CustomRetriever offers flexibility by allowing tailored data retrieval processes. As long as your retriever function is correctly written to return results in the required format, it can effectively integrate with your research agent. Key use cases include:
-
Specialized Data Queries: Customize data access for unique structures and formats.
-
Enhanced Search: Implement specific search algorithms for precise outcomes.
-
API Integration: Seamlessly fetch and incorporate data from external sources.
-
Performance Optimization: Enhance speed and efficiency for large data volumes.
-
Domain-Specific Logic: Utilize custom logic to meet specific criteria.
-
Security and Compliance: Ensure data handling aligns with necessary standards.