SentenceChunk 2.0.0¶

Overview¶

Function Intermediate

v2.0.0 Native

Available Versions: 2.0.0 (current) | 1.0.0

Description¶

Parse text into sentence-based chunks, compute embeddings, and preserve original positions.

Configuration Options¶

Name	Data Type	Description	Default Value
chunk_size	`int`	Maximum number of characters per chunk. The splitter uses intelligent sentence boundaries to create chunks that don't exceed this size while maintaining readability.	`200`
chunk_overlap	`int`	Number of overlapping characters between consecutive chunks to maintain context continuity. Must be between 0-100 characters. Helps preserve meaning across chunk boundaries.	`10`
separator	`str`	Primary token separator used for splitting text into smaller units before sentence boundary detection. Space is optimal for most text processing scenarios.
paragraph_separator	`str`	Pattern used to identify paragraph boundaries in the text. Three newlines indicate clear section breaks that should be preserved during chunking.	`\n\n\n`
secondary_chunking_regex	`str`	Regular expression for detecting sentence boundaries when primary splitting doesn't work. Handles multiple languages including English punctuation and CJK sentence terminators (。？！).	`[^,.;。？！]+[,.;。？！]?`

Inputs¶

Name	Data Type	Description
data	`Any`	Input data containing text content to be chunked. Can be a document, text string, or any object that can be converted to text using the content extraction utilities.

Outputs¶

Name	Data Type	Description
chunks	`Chunks`	Structured chunk group containing individual text chunks with preserved positional information, metadata, and parent document references. Each chunk maintains sentence boundaries and original text positions.

Version History¶

2.0.0 (Current) - Native implementation
1.0.0 - Native implementation

Examples¶

# Basic sentence-based chunking with default settings
- name: chunk_document
  block: SentenceChunk_2_0_0
  config:
    chunk_size: 300
    chunk_overlap: 15
  input:
    data: |
      Artificial intelligence has revolutionized modern computing. Machine learning algorithms can now process vast amounts of data with unprecedented accuracy. Deep learning models, in particular, have shown remarkable performance in image recognition and natural language processing tasks.

      The future of AI looks promising with ongoing research in areas like quantum computing and neural networks. These technologies will likely transform industries ranging from healthcare to autonomous vehicles.

    

# Advanced configuration for multilingual content with custom separators
- name: multilingual_chunker  
  block: SentenceChunk_2_0_0
  config:
    chunk_size: 250
    chunk_overlap: 20
    separator: " "
    paragraph_separator: "\n\n"
    secondary_chunking_regex: "[^,.;。？！]+[,.;。？！]?"
  input:
    data: |
      Natural language processing enables computers to understand human language. 自然言語処理により、コンピュータは人間の言語を理解できます。

      Machine translation systems have improved significantly. 機械翻訳システムは大幅に改善されました。The accuracy of these systems continues to increase with better training data and more sophisticated neural architectures.

      Cross-lingual understanding remains an active area of research. 言語間理解は活発な研究分野です。Researchers are developing models that can work across multiple languages simultaneously.

    

# Migration from SentenceChunk v1.0.0 to v2.0.0
# v1.0.0 configuration (deprecated):
# - Used tiktoken tokenizer dependency
# - Required explicit llm_name parameter
# - Less efficient position tracking
# - Manual chunk position calculation

# v2.0.0 equivalent (recommended):
- name: migrated_sentence_chunker
  block: SentenceChunk_2_0_0
  config:
    # Same core parameters as v1.0.0
    chunk_size: 200
    chunk_overlap: 10
    separator: " "
    paragraph_separator: "\n\n\n" 
    secondary_chunking_regex: "[^,.;。？！]+[,.;。？！]?"
    # Note: llm_name parameter removed - v2.0.0 uses optimized character-based chunking
    # Note: tiktoken dependency removed - native implementation for better performance
  input:
    data: "Your text content here..."

# Benefits of v2.0.0:
# - Improved performance without tiktoken dependency
# - Better position tracking and metadata preservation  
# - Enhanced multi-language sentence detection
# - Streamlined configuration (no llm_name needed)
# - More robust error handling and validation

    

Error Handling¶

Sentence Boundary Detection Failure

Cause: The secondary_chunking_regex pattern fails to identify proper sentence boundaries in specialized text formats or unusual punctuation.

Solution: Customize the secondary_chunking_regex parameter for your specific text format:

secondary_chunking_regex: "[^.!?]+[.!?]+"  # For basic English text
secondary_chunking_regex: "[^,.;。？！]+[,.;。？！]?"  # For CJK languages  
secondary_chunking_regex: "[^.!?…]+[.!?…]+"  # For text with ellipses

Chunk Size Validation Error

Cause: Invalid chunk_overlap value (must be 0-100) or chunk_size that's too small to create meaningful chunks.

Solution: Ensure chunk_overlap is between 0-100 and chunk_size is appropriate for your content:

config:
  chunk_size: 150      # Minimum recommended for sentence chunks
  chunk_overlap: 15    # Must be 0-100, typically 10-20% of chunk_size

Text Processing and Position Tracking Errors

Cause: Issues with text extraction from input data or problems maintaining accurate chunk positions in complex document structures.

Solution: Ensure input data is properly formatted and contains extractable text content:

# Verify input data structure
input:
  data: "Plain text string"  # Preferred format
  # OR structured data with text content
  data: 
    title: "Document Title"
    content: "Main text content for chunking"

FAQ¶

What are the key improvements in SentenceChunk v2.0.0 over v1.0.0?

Version 2.0.0 brings several significant improvements:

Removed tiktoken dependency: Eliminates external tokenizer requirements for better performance and reduced dependencies
Enhanced position tracking: More accurate and efficient chunk position calculation using native text processing
Simplified configuration: No longer requires llm_name parameter, making setup easier
Better error handling: More robust validation and clearer error messages
Improved metadata preservation: Better parent-child relationships and chunk metadata management

How does the sentence detection algorithm work in v2.0.0?

SentenceChunk v2.0.0 uses a sophisticated multi-layered approach for sentence detection:

Primary splitting: Uses the separator (default: space) to break text into tokens
Paragraph detection: Identifies paragraph boundaries using paragraph_separator (default: three newlines)
Sentence boundary detection: Applies secondary_chunking_regex to identify sentence endings
Intelligent chunking: Combines sentences to approach but not exceed chunk_size while maintaining readability
Overlap management: Adds chunk_overlap characters between consecutive chunks to preserve context

This approach works well for multiple languages including English, Chinese, Japanese, and other CJK languages through the default regex pattern.

How do I optimize chunk_size and chunk_overlap for my use case?

Optimal chunking parameters depend on your specific use case:

For embedding models: Use chunk_size 200-500 characters with 10-20% overlap (20-100 characters)
For LLM context windows: Larger chunks (500-1000 characters) with minimal overlap (10-20 characters)
For search applications: Medium chunks (300-600 characters) with moderate overlap (30-60 characters)
For summarization: Larger chunks (800-1200 characters) to provide sufficient context

Remember that chunk_overlap is capped at 100 characters. Test different combinations with your specific content to find the optimal balance between context preservation and processing efficiency.

Does SentenceChunk v2.0.0 support multiple languages effectively?

Yes, v2.0.0 includes enhanced multilingual support:

Default regex pattern: [^,.;。？！]+[,.;。？！]? handles English and CJK punctuation
English: Recognizes periods, question marks, exclamation points
Chinese/Japanese: Handles full-width punctuation (。？！)
Custom patterns: You can modify secondary_chunking_regex for other languages:
- Arabic: [^.؟!]+[.؟!]?
- Spanish: [^.¿¡!?]+[.!?]?
- French: [^.!?…]+[.!?…]?

The block automatically handles mixed-language content by applying the regex pattern consistently across all text.

What are the main benefits of migrating from v1.0.0 to v2.0.0?

Migration to v2.0.0 offers several advantages:

Performance: Faster processing without tiktoken tokenization overhead
Reduced dependencies: No external tokenizer libraries required
Simplified setup: Remove llm_name parameter from your configuration
Better accuracy: Improved position tracking and metadata preservation
Enhanced stability: More robust error handling and validation
Future-proof: Built on the latest Smartspace architecture for long-term support

Migration is straightforward - simply change your block name to SentenceChunk_2_0_0 and remove the llm_name parameter. All other configuration options remain compatible.