VectorSearch 2.0.0¶

Overview¶

Data Intermediate

Available Versions: 2.0.0 (current) | 1.0.0

Description¶

Searches for items in a dataset using vector embeddings and returns the top results based on similarity.

Configuration Options¶

Name	Data Type	Description	Default Value
topn	`int`	Maximum number of search results to return. Controls result volume and helps manage performance for large datasets.	`10`
token_limit	`int`	Maximum total tokens allowed in response. When exceeded, results are truncated by string length, then list size, then dictionary keys until under limit.	`2000`
dataspace_ids	`list[UUID] or Constant(value=None)`	Specific dataspace IDs to search within. When empty or null, searches all dataspaces accessible in current workspace.	`[]`

Inputs¶

Name	Data Type	Description
query	`str`	Search query text for vector similarity matching. Uses embedding-based similarity to find semantically related dataset items.

Outputs¶

Name	Data Type	Description
results	`list[DataSetItem]`	List of dataset items ranked by similarity score, interleaved across multiple datasets, limited by topn and token_limit configurations.

Version History¶

2.0.0 (Current) - Native implementation
1.0.0 - Native implementation

Examples¶

# Search across all accessible dataspaces
- block_type: VectorSearch_2_0_0
  name: semantic_search
  config:
    topn: 5
    token_limit: 1500
  inputs:
    query: "machine learning model accuracy metrics"
  # Returns top 5 most relevant items about ML accuracy

    

# Search within specific dataspaces only
- block_type: VectorSearch_2_0_0
  name: targeted_search
  config:
    topn: 15
    token_limit: 3000
    dataspace_ids: 
      - "550e8400-e29b-41d4-a716-446655440001"
      - "550e8400-e29b-41d4-a716-446655440002"
  inputs:
    query: "customer satisfaction survey responses Q3 2024"
  # Searches only within specified research and survey dataspaces

    

# High-volume search with token limit management
- block_type: VectorSearch_2_0_0
  name: comprehensive_search
  config:
    topn: 50
    token_limit: 8000
  inputs:
    query: "product roadmap feature priorities Q1 planning"
  # Returns many results but truncates content to fit token limit
  # Strings shortened first, then lists, then dictionary keys

# Minimal search for performance
- block_type: VectorSearch_2_0_0
  name: quick_search
  config:
    topn: 3
    token_limit: 500
  inputs:
    query: "urgent bug reports production database"
  # Fast search with minimal response size for real-time use cases

    

Error Handling¶

WorkspaceNotAvailableError

Error Code: Exception
Common Cause: VectorSearch block used outside of workspace context where dataspaces are not available
Solution: Ensure the block is used within a workspace that has configured dataspaces. Verify workspace setup and dataspace access permissions.

DataspaceAccessError

Error Code: PermissionError
Common Cause: Specified dataspace_ids are not accessible in current workspace or don't exist
Solution: Verify dataspace IDs exist and current workspace has read access. Remove invalid IDs or grant appropriate permissions.

EmbeddingConfigError

Error Code: ConfigurationError
Common Cause: Embedding service configuration unavailable or invalid for tokenization and similarity search
Solution: Check embeddings configuration in workspace settings. Ensure embedding service is properly configured and accessible.

FAQ¶

How does the token limit affect search results?

When results exceed token_limit, the block truncates content intelligently: first shortening strings, then reducing list items, then dictionary keys. This ensures responses fit within downstream processing limits while preserving the most relevant information.

How are results ranked across multiple datasets?

The block searches each dataset separately, then interleaves results to ensure fair representation. The top result from dataset A, then dataset B, then back to A's second result, etc. This prevents one large dataset from dominating results.

What happens when no dataspaces are accessible?

The search returns an empty list. Ensure your workspace has configured dataspaces and your user has appropriate read permissions. Check the workspace configuration if results are unexpectedly empty.

How do I optimize performance for large-scale searches?

Use lower topn values (3-10) for faster searches. Set appropriate token_limit based on downstream processing needs. Consider using dataspace_ids to limit scope when searching specific domains or projects.

Can I search across multiple workspaces simultaneously?

No, the block operates within the current workspace context only. To search across multiple workspaces, run separate VectorSearch blocks in each workspace and combine results in your workflow logic.