Skip to content

Retrievers Gallery

Explore the retrievers supported by the ResearchAgent of the AI Refinery SDK, designed to fetch relevant information from various sources based on user queries. Supported retrievers include:


WebSearchRetriever

The WebSearchRetriever is designed to perform web searches using external search engines. The currently supported search engine is Google Search. It is ideal for retrieving the latest information public information from the internet.

Configuration Template

Here is the configuration template for the WebSearchRetriever:

- retriever_name: <your-retriever-name>  # Required: A custom name for this retriever instance
  retriever_class: WebSearchRetriever    # Required: Specifies use of the web search retriever
  description: <optional-description>    # Optional: Brief description of what this retriever is used for

  query_transformation_examples:         # Optional: Helps transform complex user queries into effective web search queries
    - user_query: <example-user-query>
      query:
        - <transformed-query-1>
        - <transformed-query-2>

  source_weight: <weight>                # Optional: Importance weight relative to other retrievers (default: 1.0)

Use Case

The WebSearchRetriever is well-suited for retrieving publicly available information from the open internet, similar to a traditional search engine. Typical use cases include:

  • General knowledge and fact-finding
  • News updates and trending topics
  • Technical explanations or documentation
  • Comparative research on tools, services, or ideas
  • Any query requiring up-to-date or web-accessible content

AzureAISearchRetriever

The AzureAISearchRetriever is designed to perform vector-based searches over an index hosted on Azure. It is ideal for retrieving information from pre-indexed datasets.

Configuration Template

Here are the configuration template for the AzureAISearchRetriever:

- retriever_name: <your-retriever-name>  # Required: A custom name for this retriever instance
  retriever_class: AzureAISearchRetriever  # Required: Use this retriever for Azure-hosted vector search
  description: <optional-description>  # Optional: Brief explanation of what this retriever is used for

  aisearch_config:
    base_url: <your-base-url>  # Required: Base URL of your Azure vector search endpoint
    api_key: <your-api-key>  # Required: Azure AISearch service API key
    index: <your-index-name>  # Required: Name of the vector index to search

    embedding_column: <embedding-column-name>  # Required: Column in your index containing embedded data
    embedding_config:
      model: <embedding-model-name>  # Required: Must match the model used during indexing
    top_k: <number-of-results>  # Optional: Number of top documents to retrieve

    content_column:  # Required: Column(s) containing retrievable content
      - <content-column-1>
      - <content-column-2>

    aggregate_column: <optional-aggregate-column>  # Optional: Used to group chunks by document
    meta_data:  # Optional: Metadata fields to enrich the response
      - column_name: <source-column-name>  # Required within meta_data
        load_name: <display-name>  # Required within meta_data

  query_transformation_examples:  # Optional: User-to-search query examples for improved relevance
    - user_query: <example-user-query>
      query:
        - <transformed-query-1>
        - <transformed-query-2>

  source_weight: <weight-value>  # Optional: Importance weight relative to other retrievers (default: 1.0)

Use Case

The AzureAISearchRetriever is ideal for retrieving information from pre-indexed datasets via semantic search. It's best used in scenarios such as:

  • Internal knowledge base queries
  • Organizational content search
  • Semantic search over embedded data

ElasticSearchRetriever

The ElasticSearchRetriever is designed to perform vector-based searches over an index hosted in ElasticSearch. It also works well for retrieving information from structured or pre-indexed datasets.

Configuration Template

Here is the configuration template for the ElasticSearchRetriever:

- retriever_name: <your-retriever-name>  # Required: A custom name for this retriever instance
  retriever_class: ElasticSearchRetriever  # Required: Use this retriever for ElasticSearch-based vector search
  description: <optional-description>  # Optional: Brief explanation of what this retriever is used for

  elasticsearch_config:
    base_url: <your-elasticsearch-url>  # Required: Endpoint of your ElasticSearch service
    api_key: <your-api-key>  # Required: Service API key
    index: <your-index-name>  # Required: Name of the ElasticSearch index

    embedding_column: <embedding-column-name>  # Required: Column storing vector embeddings
    embedding_config:
      model: <embedding-model-name>  # Required: Must match the model used during data embedding
    top_k: <number-of-results>  # Optional: Number of top documents to retrieve

    content_column:  # Required: Column(s) containing content to retrieve
      - <content-column-1>
      - <content-column-2>

    aggregate_column: <optional-aggregate-column>  # Optional: Group chunks by original document
    meta_data:  # Optional: Metadata fields to include in results
      - column_name: <metadata-field>  # Required within meta_data
        load_name: <display-label>  # Required within meta_data

  threshold: <float-between-0-and-1>  # Optional: Filters out low-quality chunks (default: 0.9)

  query_transformation_examples:  # Optional: Transforms user queries for better search performance
    - user_query: <example-user-query>
      query:
        - <transformed-query-1>
        - <transformed-query-2>

  source_weight: <weight-value>  # Optional: Weight of this retriever relative to others (default: 1.0)

Use Case

The ElasticSearchRetriever is ideal for retrieving semantically relevant information from ElasticSearch-hosted content repositories. It excels in use cases such as:

  • Internal knowledge base queries
  • Organizational content search
  • Semantic search over embedded data

OpenSearchRetriever

The OpenSearchRetriever is designed to perform searches over an index hosted in OpenSearch. It supports four search modes: K-Nearest Neighbors (KNN), keyword search (match), hybrid search (combined search), and filtered KNN (vector search with filters).

Search Modes

The OpenSearchRetriever supports four search modes:

  1. knn (K-Nearest Neighbors): Pure vector similarity search using embeddings. Best for semantic search.

    • Requires: embedding_column, embedding_config
  2. match (Keyword Search): Traditional keyword matching using BM25 algorithm. No embeddings required.

    • Requires: search_fields
  3. hybrid (Combined Search): Combines vector similarity and keyword matching with configurable weights.

    • Requires: embedding_column, embedding_config, search_fields, vector_weight, text_weight

      Note: vector_weight + text_weight should equal 1.0

  4. filtered_knn (Vector Search with Filters): Vector search with metadata filtering that restricts results to documents where specific metadata fields match given values.

    • Requires: embedding_column, embedding_config, filters

Configuration Template

Here is the configuration template for the OpenSearchRetriever:

- retriever_name: <your-retriever-name>  # Required: A custom name for this retriever instance
  retriever_class: OpenSearchRetriever  # Required: Use this retriever for OpenSearch-based search
  description: <optional-description>  # Optional: Brief explanation of what this retriever is used for

  opensearch_config:
    # Connection settings
    host: <opensearch-host>  # Optional: OpenSearch host address (default: "localhost")
    port: <opensearch-port>  # Optional: OpenSearch port (default: 9200)
    use_ssl: <true-or-false>  # Optional: Use HTTPS connection (default: false)
    verify_certs: <true-or-false>  # Optional: Verify SSL certificates (default: false)
    http_auth: [<username>, <password>]  # Optional: Authentication credentials as a list

    # Index settings
    index: <your-index-name>  # Required: Name of the OpenSearch index
    embedding_column: <embedding-column-name>  # Optional: Column storing vector embeddings (default: "embedding")
    content_column:  # Required: Column(s) containing content to retrieve
      - <content-column-1>
      - <content-column-2>
    top_k: <number-of-results>  # Optional: Number of top documents to retrieve (default: 10)

    # Search mode configuration
    search_mode: <search-mode>  # Optional: "knn", "match", "hybrid", or "filtered_knn" (default: "knn")

    # Embedding configuration (required for knn, hybrid, and filtered_knn modes)
    embedding_config:
      model: <embedding-model-name>  # Required: Must match the model used during indexing

    # Text search settings (required for match and hybrid modes)
    search_fields:  # Required for match/hybrid: Fields for text search
      - <text-field-1>^<boost>  # Optional boost factor (e.g., "title^2")
      - <text-field-2>

    # Hybrid mode settings (only for search_mode: "hybrid")
    vector_weight: <weight-value>  # Optional: Weight for vector component (default: 0.6)
    text_weight: <weight-value>  # Optional: Weight for text component (default: 0.4)

    # Filtered KNN settings (only for search_mode: "filtered_knn")
    filters:  # Optional: Metadata filters for filtered_knn
      <field-name>: <field-value>  # Single value filter
      # <field-name>: [<value1>, <value2>]  # Multiple values (OR condition)

    # Optional settings
    aggregate_column: <optional-aggregate-column>  # Optional: Group chunks by original document
    meta_data:  # Optional: Metadata fields to include in results
      - column_name: <metadata-field>  # Required within meta_data
        load_name: <display-label>  # Required within meta_data

  threshold: <float-between-0-and-1>  # Optional: Filters out low-quality chunks (default: 0.9)

  query_transformation_examples:  # Optional: Transforms user queries for better search performance
    - user_query: <example-user-query>
      query:
        - <transformed-query-1>
        - <transformed-query-2>

  source_weight: <weight-value>  # Optional: Weight of this retriever relative to others (default: 1.0)

Configuration Examples

Example 1: KNN (Semantic Search)

opensearch_config:
  host: "localhost"
  port: 9200
  index: "documents"
  embedding_column: "embedding"
  content_column: ["content", "title"]
  search_mode: "knn"
  embedding_config:
    model: "intfloat/e5-mistral-7b-instruct"
  top_k: 5

Example 2: Match (Keyword Search)

opensearch_config:
  host: "localhost"
  port: 9200
  index: "documents"
  content_column: ["content", "title"]
  search_mode: "match"
  search_fields: ["title^2", "content"]
  top_k: 5

Example 3: Hybrid Search

opensearch_config:
  host: "localhost"
  port: 9200
  index: "documents"
  embedding_column: "embedding"
  content_column: ["content", "title"]
  search_mode: "hybrid"
  embedding_config:
    model: "intfloat/e5-mistral-7b-instruct"
  search_fields: ["title^2", "content"]
  vector_weight: 0.6
  text_weight: 0.4
  top_k: 5

Example 4: Filtered KNN

opensearch_config:
  host: "localhost"
  port: 9200
  index: "documents"
  embedding_column: "embedding"
  content_column: ["content", "title"]
  search_mode: "filtered_knn"
  embedding_config:
    model: "intfloat/e5-mistral-7b-instruct"
  filters:
    category: "AI"
    status: "published"
  top_k: 5

Use Case

The OpenSearchRetriever is ideal for retrieving information from OpenSearch-hosted content repositories with flexible search strategies. It excels in use cases such as:

  • Semantic Search: Use KNN mode for meaning-based retrieval from embedded data
  • Keyword Search: Use match mode for traditional BM25-based text matching
  • Hybrid Search: Combine semantic and keyword search for best search quality
  • Filtered Search: Apply metadata filters to narrow down vector search results
  • Internal Knowledge Bases: Query organizational content with multiple search strategies
  • Document Management Systems: Retrieve documents using the most appropriate search method

Important Notes

  • Embedding Model Consistency: The embedding_config.model must match the model used to create embeddings in your OpenSearch index. Mismatched models will produce poor search results.
  • Pre-existing Data: The OpenSearch retriever expects data to already exist in your OpenSearch index. Use OpenSearch's native tools (bulk API, Logstash, etc.) to ingest and index data.
  • Hybrid Mode Weights: When using hybrid mode, ensure vector_weight + text_weight = 1.0 for proper score normalization.

LLMsTxtRetriever

The LLMsTxtRetriever is designed to retrieve content relevant to user queries in a website based on its llms.txt file.

Configuration Template

Here are the configuration template for the LLMsTxtRetriever:

- retriever_name: <your-retriever-name>  # Required: A custom name for this retriever instance
  retriever_class: LLMsTxtRetriever  # Required: Use this retriever for llms.txt based search
  description: <optional-description>  # Optional: Brief explanation of what this retriever is used for

  llms_txt_config:
    llms_txt_url: <your-base-url>  # Required: The direct URL to the llms.txt file
    use_all_urls: false  # Optional: If False (default), use urls relevant to the query; If True, use ALL URLs from llms.txt
    url_filter_include_terms: ["model", "agent", "api"] # Optional: List of terms to include when filtering URLs from llms.txt; None (default) disables filtering
    url_filter_exclude_terms: ["deprecated", "legacy"] # Optional: List of terms to exclude when filtering URLs from llms.txt; None (default) disables filtering
    enable_async_query_page_analysis: true # If True (default), return query-relevant information generated by LLMs; otherwise, return raw content
    enable_caching: true # Optional: If True (default), enables caching model responses.
  source_weight: <weight-value>  # Optional: Importance weight relative to other retrievers (default: 1.0)

Use Case

The LLMsTxtRetriever is ideal for retrieving information from websites with llms.txt files. With a llms.txt file, LLMsTxtRetriever can navigate websites with complex structures and retrieve relevant information for better retrieval augmented generation.


CustomRetriever

The CustomRetriever allows you to design retrievers tailored to your specific use-cases, enabling retrieval of information from unique or specialized data sources.

Configuration Template

Below is an example configuration for setting up a CustomRetriever:

- retriever_name: <your-retriever-name>  # Required: A custom name for this retriever instance          
  retriever_class: CustomRetriever # Required. CustomRetriever is the type of retriever that retrieves relevant information from a vector database.             
  description: <optional-description>  # Optional. A description of the retriever  

  # Any other arbritrary config that your CustomRetriever needs
  your_arbitrary_config_1: <config-value>
  your_arbitrary_config_2: <config-value>
  your_arbitrary_config_n: <config-value>

Implementation Instructions

Retriever Function Template

You need to implement the logic for your CustomRetriever within a Python function. Below is the template for that function:

async def your_custom_retriever(query: str, your_arbitrary_config_1: Any, ..., your_arbitrary_config_n: Any) -> List[Dict[str, Any]]:  
    """  
    Retrieves information based on the provided query.  

    Args:  
        query (str): The query string used to search for relevant information.  
        your_arbitrary_config_1 (Any): An arbitrary configuration parameter with unspecified type.  
        your_arbitrary_config_n (Any): Another arbitrary configuration parameter with unspecified type.  

    Returns:  
        List[Dict[str, Any]]: A list of dictionaries, each containing:  
            - "result" (str): A string representing the retrieved text content.  
            - "score" (int or float): A numeric relevance score indicating how well the result matches the query.  
            - "source" (str or None): A string representing an identifier for the source of the retrieved item, or None if not available.  

        Note: If an error occurs or no documents are found, return [{"result": "", "score": 0, "source": None}].
    """  
    pass

All the arbitrary configurations you specified in the retriever's YAML configuration will be passed as input arguments to this function. You will have access to these configurations within your retriever function.

⚠️ Warning: The previous output format with only "result" and "score" fields is still supported for existing implementations, but please update to the new format soon as the old format may be deprecated in future versions.

Integration to executor_dict

Once you've defined your retriever function, you need to incorporate it into the executor_dict of your project using the following format:

executor_dict = {
    "<name-of-your-research-agent>": {
        "<your-custom-retriever-name>": your_custom_retriever,
    }
}
This step ensures that your function is properly registered and can be executed within the project's framework.

Use Case

CustomRetriever offers flexibility by allowing tailored data retrieval processes. As long as your retriever function is correctly written to return results in the required format, it can effectively integrate with your research agent. Key use cases include:

  • Specialized Data Queries: Customize data access for unique structures and formats.

  • Enhanced Search: Implement specific search algorithms for precise outcomes.

  • API Integration: Seamlessly fetch and incorporate data from external sources.

  • Performance Optimization: Enhance speed and efficiency for large data volumes.

  • Domain-Specific Logic: Utilize custom logic to meet specific criteria.

  • Security and Compliance: Ensure data handling aligns with necessary standards.