Skip to content

ChunkStore 1.0.0

Overview

Function Advanced

Version Source

Description

A powerful vector storage and semantic search block that processes content into meaningful chunks, computes embeddings, and enables similarity-based retrieval. This block maintains persistent state across workflow runs, making it ideal for building knowledge bases and retrieval systems.

Configuration Options

NameData TypeDescriptionDefault Value
buffer_sizeintNumber of sentences to group together when detecting semantic boundaries. Higher values create larger chunks.1
breakpoint_percentile_thresholdintPercentile threshold (0-100) for determining semantic break points in content. Higher values create fewer, larger chunks.95
top_kintMaximum number of most similar chunks to return from semantic search queries.5

Inputs

NameData TypeDescription
dataUnion[Chunks, list[Chunks]]Content chunks to store. Can be a single Chunks object or list of Chunks. Used by add_data step.
idstrUnique identifier of the parent document to retrieve. Used by get_data step.
querystrNatural language search query for finding semantically similar content. Used by semantic_search step.
runAnyTrigger input to get current storage information. Used by get_info step.

Outputs

NameData TypeDescription
infoStoreInfoResponseStorage status containing newly uploaded and previously stored content information. Output from add_data and get_info steps.
dataParentType or GetDataErrorRetrieved parent document by ID, or error if not found. Output from get_data step.
chunkslist[FileChunk or WebDataChunk or GenericChunk]Top-k most similar chunks ranked by cosine similarity score. Output from semantic_search step.

Examples

# Store document chunks and search
name: knowledge_base
type: ChunkStore
config:
  top_k: 10
  buffer_size: 2
  breakpoint_percentile_threshold: 90
inputs:
  data: "Document chunks from previous processing step"
  query: "What is the company's return policy?"
outputs:
  info: "storage_status"
  chunks: "relevant_chunks"

Error Handling

Document Not Found

Error Code
GetDataError
Common Cause
Requested document ID does not exist in the chunk store
Solution
Verify the document ID is correct or check if it was successfully added using get_info step

Embeddings Service Error

Error Code
EmbeddingsServiceError
Common Cause
Embedding computation failed due to service unavailability or rate limits
Solution
Check embedding service status, reduce batch size, or implement retry logic with backoff

Invalid Chunk Format

Error Code
ValidationError
Common Cause
Input data doesn't match expected Chunks format or contains invalid content types
Solution
Ensure input data follows the correct Chunks schema with valid parent and chunk objects

FAQ

How does semantic chunking work?

The block analyzes content semantics to identify natural break points, grouping related sentences together. The buffer_size controls grouping size, while breakpoint_percentile_threshold determines sensitivity to semantic changes.

What happens to chunks when I add the same document twice?

The block tracks document IDs and won't duplicate chunks. Previously uploaded documents are listed separately in the StoreInfoResponse, helping you understand what's new versus already processed.

How accurate is the semantic search?

Search uses cosine similarity between query and chunk embeddings. Accuracy depends on the quality of your embedding service and how well the chunks capture semantic meaning. Results are ranked by similarity score.

Can I search across different types of content (files, web data, etc.)?

Yes! The block supports FileChunk, WebDataChunk, and GenericChunk types, allowing unified search across diverse content sources within the same vector store.

How much memory does the chunk store use?

Memory usage scales with the number of chunks and embedding dimensions. Each chunk stores its content, embeddings (typically 1536 dimensions), and metadata. Monitor usage with large document sets.

Should I tune the breakpoint_percentile_threshold?

Higher values (95+) create fewer, larger chunks good for broad topics. Lower values (80-90) create smaller, focused chunks better for specific details. Test with your content type to find optimal settings.