Skip to content

GetFiles 1.0.0

Overview

Data Beginner

Version Source

Description

Converts multiple files to a specified format and outputs their contents as a list. Keeps track of processed files to avoid reprocessing.

Configuration Options

NameData TypeDescriptionDefault Value
to_formatPandocToFormatsTarget output format for all processed files. Converts all file content to the specified format (MARKDOWN, HTML, JSON, etc.) for consistent processing.PandocToFormats.MARKDOWN
use_document_intelligenceboolWhether to use Azure Document Intelligence for supported file types. Provides enhanced text extraction and layout preservation for documents and images.True

Inputs

NameData TypeDescription
fileslist[File]List of files to process and convert. The block maintains state to avoid reprocessing previously handled files, making it efficient for incremental batch operations.

Outputs

NameData TypeDescription
outputlist[str]List of converted file contents as strings, in the same order as processed. Each string contains the full content of a file converted to the specified format.

Examples

# Process multiple documents and extract their content as Markdown
- block_type: GetFiles
  block_source: native
  block_version: 1.0.0
  config:
    to_format: MARKDOWN
    use_document_intelligence: true
  inputs:
    files: "{{uploaded_documents}}"  # List of PDF, DOCX, or other files
  outputs:
    output: "{{processed_content_list}}"  # List of extracted text content

Error Handling

Common Errors and Solutions

Failed to process file [filename]

Cause: Individual file processing failed due to format issues, corruption, or service unavailability.

Solution: The block continues processing other files and excludes failed files from the processed list. Check file format compatibility and try reprocessing failed files individually.

Document Intelligence quota exceeded

Cause: Azure Document Intelligence service has reached its usage limit.

Solution: Set `use_document_intelligence` to `false` to use Pandoc fallback, or wait for quota reset. Consider batch processing during off-peak hours.

Memory issues with large file batches

Cause: Processing many large files simultaneously can exceed memory limits.

Solution: Process files in smaller batches or use streaming approaches. The state management helps avoid reprocessing files across multiple runs.

Inconsistent output format

Cause: Some files cannot be converted to the target format specified in `to_format`.

Solution: Check file format compatibility with the target format. Use MARKDOWN as a reliable fallback format that works with most source types.

FAQ

How does the state management work to avoid reprocessing?

The block maintains an internal list of processed file IDs and their extracted content. When you run the block again with the same files, it only processes new files that haven't been seen before. This makes it efficient for incremental batch processing and avoids duplicate work.

What happens if one file fails to process?

The block continues processing other files and prints an error message for failed files. Failed files are not added to the processed list, allowing them to be retried on subsequent runs. The output list contains only successfully processed content.

How is the output order determined?

The output list maintains the order of successfully processed files, not necessarily the input order. Files processed in previous runs appear first, followed by newly processed files. Failed files do not appear in the output list.

Can I reset the processed files state?

The processed files state persists within the same workflow execution. To reset and reprocess all files, you would need to start a new workflow execution or use individual GetFileContent blocks instead.

What file formats work best with this block?

PDF, DOCX, XLSX, and PPTX work best with Document Intelligence enabled. Plain text formats (TXT, HTML, JSON) process quickly with either method. Images with text require Document Intelligence for OCR. Complex formats may need specific format combinations for optimal results.