ConvertDocumentContent 1.0.0¶
Overview¶
Description¶
⚠️ DEPRECATED: This block will be removed in a future version. Use the GetFileContent block instead for new workflows.
Converts file content to another format using pandoc or Azure Document Intelligence. This block is particularly useful for transforming documents between different formats while preserving layout and structure when possible.
Configuration Options¶
| Name | Data Type | Description | Default Value |
|---|---|---|---|
| to_format | PandocToFormats | Target format for document conversion. Supports various formats including Markdown, HTML, PDF, DOCX, and more through Pandoc engine. | PandocToFormats.MARKDOWN |
| use_document_intelligence | bool | Enable Azure Document Intelligence for advanced extraction from PDF, DOCX, and PPTX files. Provides better layout preservation and OCR capabilities. | True |
Inputs¶
| Name | Data Type | Description |
|---|---|---|
| file | File | The file object to be converted. Must be a valid File object with accessible content through the blob storage service. |
Outputs¶
| Name | Data Type | Description |
|---|---|---|
| output | str | The converted document content as a string in the specified target format. |
Examples¶
# Basic document conversion to Markdown
name: convert_pdf_to_markdown
type: ConvertDocumentContent
config:
to_format: MARKDOWN
use_document_intelligence: true
inputs:
file: "Invoice_Q4_2024.pdf"
outputs:
output: "converted_content"
Error Handling¶
File Access Error
- Error Code
BlobServiceError- Common Cause
- File ID is invalid or blob storage service cannot access the file
- Solution
- Verify the file exists and the file ID is correct. Check blob storage connectivity
Document Intelligence Service Error
- Error Code
DocumentIntelligenceError- Common Cause
- Azure Document Intelligence service is unavailable or rate limited
- Solution
- Set use_document_intelligence to false to use Pandoc fallback, or retry after delay
Format Conversion Error
- Error Code
PandocError- Common Cause
- Target format is not supported for the input file type or content is corrupted
- Solution
- Try a different target format or verify the source file is not corrupted
FAQ¶
Why should I use GetFileContent instead of ConvertDocumentContent?
GetFileContent is the newer, more comprehensive block that combines file content extraction with format conversion. It provides better error handling, supports more file types, and has cleaner output handling.
Which file formats support Document Intelligence?
Document Intelligence works best with PDF, DOCX, and PPTX files. It provides superior layout preservation and OCR capabilities for these formats compared to Pandoc alone.
How do I handle large documents efficiently?
For large documents, consider disabling Document Intelligence (set to false) to use Pandoc directly, which may be faster. Also ensure your blob storage has sufficient throughput capacity.
What happens if both Document Intelligence and Pandoc fail?
The block will throw an exception and stop processing. Implement error handling in your workflow to catch conversion failures and potentially retry with different settings.
Can I convert between any two formats?
Not all format combinations are supported. Common conversions like PDF/DOCX to Markdown work well, but some specialized formats may not be supported by Pandoc or Document Intelligence.