FileStore 1.0.0¶

Overview¶

Function Advanced

Description¶

Store and search embeddings in an in-memory vector database.

Args: top_k: The number of results to return in search.

Steps: 1: Update index with embeddings. 2: Search index to return relevant documents.

Configuration Options¶

Name	Data Type	Description	Default Value
top_k	`int`	Number of top-ranked results to return when performing semantic search. Controls the maximum results returned from vector similarity search.	`5`
use_document_intelligence	`bool`	Enable Azure Document Intelligence for advanced document processing. When enabled, extracts text from PDF, DOCX, PPTX, and XLSX files with better accuracy and formatting preservation.	`True`

Inputs¶

Name	Data Type	Description
files	`list[File]`	List of files to process, chunk, and store in the vector database. Supports various formats including PDF, DOCX, PPTX, XLSX, and other text-based formats.
filename	`str`	Name of a specific file to retrieve content for. Used to get the raw content of a previously processed file.
query	`str`	Search query text for semantic search. The query is embedded and compared against stored document chunks to find the most relevant content.

Outputs¶

Name	Data Type	Description
all_files	`list[FileInfo]`	Complete list of all files that have been processed and stored in the vector database, including metadata (filename, content_length, id).
files	`list[FileInfo]`	List of newly processed files in the current batch, containing metadata for files that were added in this operation.
file_content	`str`	Raw text content of a specific file retrieved by filename. Returns the complete extracted text content of the requested file.
chunks	`list[str]`	List of the most relevant text chunks returned from semantic search, ranked by similarity to the query. Contains the top_k most similar document segments.

Examples¶

# Store files and perform semantic search
- id: store_documents
  uses: FileStore@1.0.0
  with:
    top_k: 5
    use_document_intelligence: true
    files:
      - id: "doc1"
        name: "company_policy.pdf"
      - id: "doc2"
        name: "training_manual.docx"
  outputs:
    all_files: stored_files
    files: new_files

# Search for specific content
- id: search_content
  uses: FileStore@1.0.0
  with:
    query: "vacation policy and time off"
  outputs:
    chunks: search_results

# Retrieve specific file content
- id: get_file
  uses: FileStore@1.0.0
  with:
    filename: "company_policy.pdf"
  outputs:
    file_content: policy_text

    

# Advanced file processing with custom chunking
- id: process_research_papers
  uses: FileStore@1.0.0
  with:
    top_k: 10
    use_document_intelligence: false
    files:
      - id: "paper1"
        name: "ai_research.pdf"
      - id: "paper2"
        name: "machine_learning_trends.docx"
      - id: "paper3"
        name: "data_analysis.xlsx"
  chunk:
    uses: SemanticChunk@2.0.0
    with:
      chunk_size: 1000
      overlap: 200
  outputs:
    all_files: research_files
    files: processed_papers

# Multi-step search workflow
- id: search_methodology
  uses: FileStore@1.0.0
  with:
    query: "machine learning algorithms and neural networks"
    top_k: 8
  outputs:
    chunks: ml_chunks

- id: search_results_analysis
  uses: FileStore@1.0.0
  with:
    query: "experimental results and performance metrics"
    top_k: 6
  outputs:
    chunks: results_chunks

# Content retrieval for multiple files
- id: get_research_content
  uses: FileStore@1.0.0
  with:
    filename: "ai_research.pdf"
  outputs:
    file_content: research_content

    

Error Handling¶

FileContentExtractionError

Error Code: file_content_extraction_failed
Common Cause: Failed to extract text content from uploaded file due to corruption, unsupported format, or processing limitations
Solution: Verify file integrity, check format support, ensure file is not password-protected or corrupted

EmbeddingGenerationError

Error Code: embedding_generation_failed
Common Cause: Failed to generate embeddings for document chunks due to service limits or network issues
Solution: Check embedding service availability, verify content length limits, retry with smaller chunks

ChunkingError

Error Code: chunking_failed
Common Cause: Chunking tool failed to process document content, often due to invalid or malformed text
Solution: Verify chunking tool configuration, check input text format, ensure content is processable

FAQ¶

What file formats are supported for processing?

FileStore supports a wide range of formats: PDF, DOCX, PPTX, XLSX (via Document Intelligence when enabled), plus text formats like TXT, CSV, MD, HTML, RTF, and more via Pandoc conversion. Binary formats may require Document Intelligence for optimal text extraction.

How does the vector search work?

Documents are chunked into smaller segments, each chunk is converted to embeddings using the embeddings service, and stored in memory. Search queries are embedded and compared using cosine similarity to find the most relevant chunks, returning the top_k results.

When should I enable Document Intelligence?

Enable Document Intelligence for complex documents (PDF, DOCX, PPTX, XLSX) with formatting, tables, or images. It provides better text extraction and layout preservation compared to basic Pandoc conversion, but may have higher processing costs.

How are duplicate files handled?

FileStore tracks processed files by their ID and skips reprocessing duplicates. The all_files output contains all previously processed files, while files output only includes newly processed files from the current batch.

Can I customize the chunking strategy?

Yes, FileStore uses a configurable ChunkTool that can be customized. You can provide different chunking blocks like SemanticChunk, TokenChunk, or SentenceChunk with specific parameters to control how documents are segmented for embedding and search.