ConvertDocumentContent 1.0.0¶

Overview¶

Data Beginner

Description¶

⚠️ DEPRECATED: This block will be removed in a future version. Use the GetFileContent block instead for new workflows.

Converts file content to another format using pandoc or Azure Document Intelligence. This block is particularly useful for transforming documents between different formats while preserving layout and structure when possible.

Configuration Options¶

Name	Data Type	Description	Default Value
to_format	`PandocToFormats`	Target format for document conversion. Supports various formats including Markdown, HTML, PDF, DOCX, and more through Pandoc engine.	`PandocToFormats.MARKDOWN`
use_document_intelligence	`bool`	Enable Azure Document Intelligence for advanced extraction from PDF, DOCX, and PPTX files. Provides better layout preservation and OCR capabilities.	`True`

Inputs¶

Name	Data Type	Description
file	`File`	The file object to be converted. Must be a valid File object with accessible content through the blob storage service.

Outputs¶

Name	Data Type	Description
output	`str`	The converted document content as a string in the specified target format.

Examples¶

# Basic document conversion to Markdown
name: convert_pdf_to_markdown
type: ConvertDocumentContent
config:
  to_format: MARKDOWN
  use_document_intelligence: true
inputs:
  file: "Invoice_Q4_2024.pdf"
outputs:
  output: "converted_content"

    

# Convert presentation to HTML with fallback to Pandoc
name: convert_presentation
type: ConvertDocumentContent
config:
  to_format: HTML
  use_document_intelligence: false
inputs:
  file: "Quarterly_Review.pptx"
outputs:
  output: "html_presentation"

    

Error Handling¶

File Access Error

Error Code: BlobServiceError
Common Cause: File ID is invalid or blob storage service cannot access the file
Solution: Verify the file exists and the file ID is correct. Check blob storage connectivity

Document Intelligence Service Error

Error Code: DocumentIntelligenceError
Common Cause: Azure Document Intelligence service is unavailable or rate limited
Solution: Set use_document_intelligence to false to use Pandoc fallback, or retry after delay

Format Conversion Error

Error Code: PandocError
Common Cause: Target format is not supported for the input file type or content is corrupted
Solution: Try a different target format or verify the source file is not corrupted

FAQ¶

Why should I use GetFileContent instead of ConvertDocumentContent?

GetFileContent is the newer, more comprehensive block that combines file content extraction with format conversion. It provides better error handling, supports more file types, and has cleaner output handling.

Which file formats support Document Intelligence?

Document Intelligence works best with PDF, DOCX, and PPTX files. It provides superior layout preservation and OCR capabilities for these formats compared to Pandoc alone.

How do I handle large documents efficiently?

For large documents, consider disabling Document Intelligence (set to false) to use Pandoc directly, which may be faster. Also ensure your blob storage has sufficient throughput capacity.

What happens if both Document Intelligence and Pandoc fail?

The block will throw an exception and stop processing. Implement error handling in your workflow to catch conversion failures and potentially retry with different settings.

Can I convert between any two formats?

Not all format combinations are supported. Common conversions like PDF/DOCX to Markdown work well, but some specialized formats may not be supported by Pandoc or Document Intelligence.