Skip to content

FileStore 1.0.0

Overview

Function Advanced

Version Source

Description

Store and search embeddings in an in-memory vector database.

Args: top_k: The number of results to return in search.

Steps: 1: Update index with embeddings. 2: Search index to return relevant documents.

Configuration Options

NameData TypeDescriptionDefault Value
top_kintNumber of top-ranked results to return when performing semantic search. Controls the maximum results returned from vector similarity search.5
use_document_intelligenceboolEnable Azure Document Intelligence for advanced document processing. When enabled, extracts text from PDF, DOCX, PPTX, and XLSX files with better accuracy and formatting preservation.True

Inputs

NameData TypeDescription
fileslist[File]List of files to process, chunk, and store in the vector database. Supports various formats including PDF, DOCX, PPTX, XLSX, and other text-based formats.
filenamestrName of a specific file to retrieve content for. Used to get the raw content of a previously processed file.
querystrSearch query text for semantic search. The query is embedded and compared against stored document chunks to find the most relevant content.

Outputs

NameData TypeDescription
all_fileslist[FileInfo]Complete list of all files that have been processed and stored in the vector database, including metadata (filename, content_length, id).
fileslist[FileInfo]List of newly processed files in the current batch, containing metadata for files that were added in this operation.
file_contentstrRaw text content of a specific file retrieved by filename. Returns the complete extracted text content of the requested file.
chunkslist[str]List of the most relevant text chunks returned from semantic search, ranked by similarity to the query. Contains the top_k most similar document segments.

Examples

# Store files and perform semantic search
- id: store_documents
  uses: FileStore@1.0.0
  with:
    top_k: 5
    use_document_intelligence: true
    files:
      - id: "doc1"
        name: "company_policy.pdf"
      - id: "doc2"
        name: "training_manual.docx"
  outputs:
    all_files: stored_files
    files: new_files

# Search for specific content
- id: search_content
  uses: FileStore@1.0.0
  with:
    query: "vacation policy and time off"
  outputs:
    chunks: search_results

# Retrieve specific file content
- id: get_file
  uses: FileStore@1.0.0
  with:
    filename: "company_policy.pdf"
  outputs:
    file_content: policy_text

Error Handling

FileContentExtractionError

Error Code
file_content_extraction_failed
Common Cause
Failed to extract text content from uploaded file due to corruption, unsupported format, or processing limitations
Solution
Verify file integrity, check format support, ensure file is not password-protected or corrupted

EmbeddingGenerationError

Error Code
embedding_generation_failed
Common Cause
Failed to generate embeddings for document chunks due to service limits or network issues
Solution
Check embedding service availability, verify content length limits, retry with smaller chunks

ChunkingError

Error Code
chunking_failed
Common Cause
Chunking tool failed to process document content, often due to invalid or malformed text
Solution
Verify chunking tool configuration, check input text format, ensure content is processable

FAQ

What file formats are supported for processing?

FileStore supports a wide range of formats: PDF, DOCX, PPTX, XLSX (via Document Intelligence when enabled), plus text formats like TXT, CSV, MD, HTML, RTF, and more via Pandoc conversion. Binary formats may require Document Intelligence for optimal text extraction.

How does the vector search work?

Documents are chunked into smaller segments, each chunk is converted to embeddings using the embeddings service, and stored in memory. Search queries are embedded and compared using cosine similarity to find the most relevant chunks, returning the top_k results.

When should I enable Document Intelligence?

Enable Document Intelligence for complex documents (PDF, DOCX, PPTX, XLSX) with formatting, tables, or images. It provides better text extraction and layout preservation compared to basic Pandoc conversion, but may have higher processing costs.

How are duplicate files handled?

FileStore tracks processed files by their ID and skips reprocessing duplicates. The all_files output contains all previously processed files, while files output only includes newly processed files from the current batch.

Can I customize the chunking strategy?

Yes, FileStore uses a configurable ChunkTool that can be customized. You can provide different chunking blocks like SemanticChunk, TokenChunk, or SentenceChunk with specific parameters to control how documents are segmented for embedding and search.