SentenceChunk 2.0.0¶
Overview¶
v2.0.0 Native
Available Versions: 2.0.0 (current) | 1.0.0
Description¶
Parse text into sentence-based chunks, compute embeddings, and preserve original positions.
Configuration Options¶
| Name | Data Type | Description | Default Value |
|---|---|---|---|
| chunk_size | int | Maximum number of characters per chunk. The splitter uses intelligent sentence boundaries to create chunks that don't exceed this size while maintaining readability. | 200 |
| chunk_overlap | int | Number of overlapping characters between consecutive chunks to maintain context continuity. Must be between 0-100 characters. Helps preserve meaning across chunk boundaries. | 10 |
| separator | str | Primary token separator used for splitting text into smaller units before sentence boundary detection. Space is optimal for most text processing scenarios. | |
| paragraph_separator | str | Pattern used to identify paragraph boundaries in the text. Three newlines indicate clear section breaks that should be preserved during chunking. | \n\n\n |
| secondary_chunking_regex | str | Regular expression for detecting sentence boundaries when primary splitting doesn't work. Handles multiple languages including English punctuation and CJK sentence terminators (。?!). | [^,.;。?!]+[,.;。?!]? |
Inputs¶
| Name | Data Type | Description |
|---|---|---|
| data | Any | Input data containing text content to be chunked. Can be a document, text string, or any object that can be converted to text using the content extraction utilities. |
Outputs¶
| Name | Data Type | Description |
|---|---|---|
| chunks | Chunks | Structured chunk group containing individual text chunks with preserved positional information, metadata, and parent document references. Each chunk maintains sentence boundaries and original text positions. |
Version History¶
- 2.0.0 (Current) - Native implementation
- 1.0.0 - Native implementation
Examples¶
# Basic sentence-based chunking with default settings
- name: chunk_document
block: SentenceChunk_2_0_0
config:
chunk_size: 300
chunk_overlap: 15
input:
data: |
Artificial intelligence has revolutionized modern computing. Machine learning algorithms can now process vast amounts of data with unprecedented accuracy. Deep learning models, in particular, have shown remarkable performance in image recognition and natural language processing tasks.
The future of AI looks promising with ongoing research in areas like quantum computing and neural networks. These technologies will likely transform industries ranging from healthcare to autonomous vehicles.
Error Handling¶
Sentence Boundary Detection Failure
Cause: The secondary_chunking_regex pattern fails to identify proper sentence boundaries in specialized text formats or unusual punctuation.
Solution: Customize the secondary_chunking_regex parameter for your specific text format:
secondary_chunking_regex: "[^.!?]+[.!?]+" # For basic English text
secondary_chunking_regex: "[^,.;。?!]+[,.;。?!]?" # For CJK languages
secondary_chunking_regex: "[^.!?…]+[.!?…]+" # For text with ellipses
Chunk Size Validation Error
Cause: Invalid chunk_overlap value (must be 0-100) or chunk_size that's too small to create meaningful chunks.
Solution: Ensure chunk_overlap is between 0-100 and chunk_size is appropriate for your content:
config:
chunk_size: 150 # Minimum recommended for sentence chunks
chunk_overlap: 15 # Must be 0-100, typically 10-20% of chunk_size
Text Processing and Position Tracking Errors
Cause: Issues with text extraction from input data or problems maintaining accurate chunk positions in complex document structures.
Solution: Ensure input data is properly formatted and contains extractable text content:
# Verify input data structure
input:
data: "Plain text string" # Preferred format
# OR structured data with text content
data:
title: "Document Title"
content: "Main text content for chunking"
FAQ¶
What are the key improvements in SentenceChunk v2.0.0 over v1.0.0?
Version 2.0.0 brings several significant improvements:
- Removed tiktoken dependency: Eliminates external tokenizer requirements for better performance and reduced dependencies
- Enhanced position tracking: More accurate and efficient chunk position calculation using native text processing
- Simplified configuration: No longer requires llm_name parameter, making setup easier
- Better error handling: More robust validation and clearer error messages
- Improved metadata preservation: Better parent-child relationships and chunk metadata management
How does the sentence detection algorithm work in v2.0.0?
SentenceChunk v2.0.0 uses a sophisticated multi-layered approach for sentence detection:
- Primary splitting: Uses the separator (default: space) to break text into tokens
- Paragraph detection: Identifies paragraph boundaries using paragraph_separator (default: three newlines)
- Sentence boundary detection: Applies secondary_chunking_regex to identify sentence endings
- Intelligent chunking: Combines sentences to approach but not exceed chunk_size while maintaining readability
- Overlap management: Adds chunk_overlap characters between consecutive chunks to preserve context
This approach works well for multiple languages including English, Chinese, Japanese, and other CJK languages through the default regex pattern.
How do I optimize chunk_size and chunk_overlap for my use case?
Optimal chunking parameters depend on your specific use case:
- For embedding models: Use chunk_size 200-500 characters with 10-20% overlap (20-100 characters)
- For LLM context windows: Larger chunks (500-1000 characters) with minimal overlap (10-20 characters)
- For search applications: Medium chunks (300-600 characters) with moderate overlap (30-60 characters)
- For summarization: Larger chunks (800-1200 characters) to provide sufficient context
Remember that chunk_overlap is capped at 100 characters. Test different combinations with your specific content to find the optimal balance between context preservation and processing efficiency.
Does SentenceChunk v2.0.0 support multiple languages effectively?
Yes, v2.0.0 includes enhanced multilingual support:
- Default regex pattern:
[^,.;。?!]+[,.;。?!]?handles English and CJK punctuation - English: Recognizes periods, question marks, exclamation points
- Chinese/Japanese: Handles full-width punctuation (。?!)
- Custom patterns: You can modify secondary_chunking_regex for other languages:
- Arabic:
[^.؟!]+[.؟!]? - Spanish:
[^.¿¡!?]+[.!?]? - French:
[^.!?…]+[.!?…]?
- Arabic:
The block automatically handles mixed-language content by applying the regex pattern consistently across all text.
What are the main benefits of migrating from v1.0.0 to v2.0.0?
Migration to v2.0.0 offers several advantages:
- Performance: Faster processing without tiktoken tokenization overhead
- Reduced dependencies: No external tokenizer libraries required
- Simplified setup: Remove llm_name parameter from your configuration
- Better accuracy: Improved position tracking and metadata preservation
- Enhanced stability: More robust error handling and validation
- Future-proof: Built on the latest Smartspace architecture for long-term support
Migration is straightforward - simply change your block name to SentenceChunk_2_0_0 and remove the llm_name parameter. All other configuration options remain compatible.