Skip to content

ConvertDocumentContent 1.0.0

Overview

Data Beginner

Version Source

Description

⚠️ DEPRECATED: This block will be removed in a future version. Use the GetFileContent block instead for new workflows.

Converts file content to another format using pandoc or Azure Document Intelligence. This block is particularly useful for transforming documents between different formats while preserving layout and structure when possible.

Configuration Options

NameData TypeDescriptionDefault Value
to_formatPandocToFormatsTarget format for document conversion. Supports various formats including Markdown, HTML, PDF, DOCX, and more through Pandoc engine.PandocToFormats.MARKDOWN
use_document_intelligenceboolEnable Azure Document Intelligence for advanced extraction from PDF, DOCX, and PPTX files. Provides better layout preservation and OCR capabilities.True

Inputs

NameData TypeDescription
fileFileThe file object to be converted. Must be a valid File object with accessible content through the blob storage service.

Outputs

NameData TypeDescription
outputstrThe converted document content as a string in the specified target format.

Examples

# Basic document conversion to Markdown
name: convert_pdf_to_markdown
type: ConvertDocumentContent
config:
  to_format: MARKDOWN
  use_document_intelligence: true
inputs:
  file: "Invoice_Q4_2024.pdf"
outputs:
  output: "converted_content"

Error Handling

File Access Error

Error Code
BlobServiceError
Common Cause
File ID is invalid or blob storage service cannot access the file
Solution
Verify the file exists and the file ID is correct. Check blob storage connectivity

Document Intelligence Service Error

Error Code
DocumentIntelligenceError
Common Cause
Azure Document Intelligence service is unavailable or rate limited
Solution
Set use_document_intelligence to false to use Pandoc fallback, or retry after delay

Format Conversion Error

Error Code
PandocError
Common Cause
Target format is not supported for the input file type or content is corrupted
Solution
Try a different target format or verify the source file is not corrupted

FAQ

Why should I use GetFileContent instead of ConvertDocumentContent?

GetFileContent is the newer, more comprehensive block that combines file content extraction with format conversion. It provides better error handling, supports more file types, and has cleaner output handling.

Which file formats support Document Intelligence?

Document Intelligence works best with PDF, DOCX, and PPTX files. It provides superior layout preservation and OCR capabilities for these formats compared to Pandoc alone.

How do I handle large documents efficiently?

For large documents, consider disabling Document Intelligence (set to false) to use Pandoc directly, which may be faster. Also ensure your blob storage has sufficient throughput capacity.

What happens if both Document Intelligence and Pandoc fail?

The block will throw an exception and stop processing. Implement error handling in your workflow to catch conversion failures and potentially retry with different settings.

Can I convert between any two formats?

Not all format combinations are supported. Common conversions like PDF/DOCX to Markdown work well, but some specialized formats may not be supported by Pandoc or Document Intelligence.