ChunkStore 1.0.0¶
Overview¶
Description¶
A powerful vector storage and semantic search block that processes content into meaningful chunks, computes embeddings, and enables similarity-based retrieval. This block maintains persistent state across workflow runs, making it ideal for building knowledge bases and retrieval systems.
Configuration Options¶
| Name | Data Type | Description | Default Value |
|---|---|---|---|
| buffer_size | int | Number of sentences to group together when detecting semantic boundaries. Higher values create larger chunks. | 1 |
| breakpoint_percentile_threshold | int | Percentile threshold (0-100) for determining semantic break points in content. Higher values create fewer, larger chunks. | 95 |
| top_k | int | Maximum number of most similar chunks to return from semantic search queries. | 5 |
Inputs¶
| Name | Data Type | Description |
|---|---|---|
| data | Union[Chunks, list[Chunks]] | Content chunks to store. Can be a single Chunks object or list of Chunks. Used by add_data step. |
| id | str | Unique identifier of the parent document to retrieve. Used by get_data step. |
| query | str | Natural language search query for finding semantically similar content. Used by semantic_search step. |
| run | Any | Trigger input to get current storage information. Used by get_info step. |
Outputs¶
| Name | Data Type | Description |
|---|---|---|
| info | StoreInfoResponse | Storage status containing newly uploaded and previously stored content information. Output from add_data and get_info steps. |
| data | ParentType or GetDataError | Retrieved parent document by ID, or error if not found. Output from get_data step. |
| chunks | list[FileChunk or WebDataChunk or GenericChunk] | Top-k most similar chunks ranked by cosine similarity score. Output from semantic_search step. |
Examples¶
# Store document chunks and search
name: knowledge_base
type: ChunkStore
config:
top_k: 10
buffer_size: 2
breakpoint_percentile_threshold: 90
inputs:
data: "Document chunks from previous processing step"
query: "What is the company's return policy?"
outputs:
info: "storage_status"
chunks: "relevant_chunks"
Error Handling¶
Document Not Found
- Error Code
GetDataError- Common Cause
- Requested document ID does not exist in the chunk store
- Solution
- Verify the document ID is correct or check if it was successfully added using get_info step
Embeddings Service Error
- Error Code
EmbeddingsServiceError- Common Cause
- Embedding computation failed due to service unavailability or rate limits
- Solution
- Check embedding service status, reduce batch size, or implement retry logic with backoff
Invalid Chunk Format
- Error Code
ValidationError- Common Cause
- Input data doesn't match expected Chunks format or contains invalid content types
- Solution
- Ensure input data follows the correct Chunks schema with valid parent and chunk objects
FAQ¶
How does semantic chunking work?
The block analyzes content semantics to identify natural break points, grouping related sentences together. The buffer_size controls grouping size, while breakpoint_percentile_threshold determines sensitivity to semantic changes.
What happens to chunks when I add the same document twice?
The block tracks document IDs and won't duplicate chunks. Previously uploaded documents are listed separately in the StoreInfoResponse, helping you understand what's new versus already processed.
How accurate is the semantic search?
Search uses cosine similarity between query and chunk embeddings. Accuracy depends on the quality of your embedding service and how well the chunks capture semantic meaning. Results are ranked by similarity score.
Can I search across different types of content (files, web data, etc.)?
Yes! The block supports FileChunk, WebDataChunk, and GenericChunk types, allowing unified search across diverse content sources within the same vector store.
How much memory does the chunk store use?
Memory usage scales with the number of chunks and embedding dimensions. Each chunk stores its content, embeddings (typically 1536 dimensions), and metadata. Monitor usage with large document sets.
Should I tune the breakpoint_percentile_threshold?
Higher values (95+) create fewer, larger chunks good for broad topics. Lower values (80-90) create smaller, focused chunks better for specific details. Test with your content type to find optimal settings.