Skip to content

DocumentCreator 1.0.0

Overview

Data Beginner

Version Source

Description

Converts a string (Markdown, HTML, or other supported formats) to a specified document format (Word, PDF, ODT, EPUB, etc.) using Pandoc and saves it to blob storage.Retrieves an existing file from blob storage, converts it to a specified document format (Word, PDF, ODT, EPUB, etc.) using Pandoc, and saves the converted file back to blob storage.

Configuration Options

NameData TypeDescriptionDefault Value
output_file_typeFileTypeTarget document format for conversion. Supported formats: DOCX, PDF, ODT, EPUB, HTML, RTF, JSON, PPTX, MARKDOWN. PDF conversion uses XeLaTeX engine for advanced formatting.FileType.DOCX
titlestrOutput filename (without extension) for the converted document. Used as the base name when saving to blob storage.Document_Title

Inputs

NameData TypeDescription
contentUnion[str, File]Input content for conversion. Can be a string containing Markdown, HTML, or other supported markup formats, OR an existing File object from blob storage (supports PDF, DOCX, XLSX, PPTX, EML, ICS, TXT, CSV, TSV, JSON, HTML formats).

Outputs

NameData TypeDescription
resultFileConverted document file saved in blob storage. Contains file metadata (id, name, size, content_type) and can be used as input for other file processing blocks.

Examples

# Convert Markdown text to Word document
- id: create_word_doc
  uses: DocumentCreator@1.0.0
  with:
    title: "Sales Report Q4"
    output_file_type: "docx"
    content: |
      # Q4 Sales Report

      ## Executive Summary
      Sales increased by 23% compared to Q3, reaching $2.4M total revenue.

      ## Key Metrics
      - New customers: 450
      - Retention rate: 87%
      - Average deal size: $5,300

      ## Next Steps
      Focus on enterprise accounts and product expansion.
  outputs:
    result: sales_report_docx

Error Handling

PandocConversionError

Error Code
pandoc_conversion_failed
Common Cause
Pandoc cannot convert the input format to target format due to unsupported features or malformed input
Solution
Check input format compatibility, ensure content is valid markup, verify Pandoc supports the conversion path

BlobStorageError

Error Code
blob_save_failed
Common Cause
Failed to save converted file to blob storage due to storage limits or connection issues
Solution
Verify blob storage connectivity, check storage quotas, ensure proper blob service configuration

FileProcessingError

Error Code
file_processing_failed
Common Cause
Input file is corrupted, empty, or in unsupported format for extraction/conversion
Solution
Verify input file integrity, check file format support, ensure file is not password-protected or corrupted

FAQ

What input formats are supported for string content?

The block auto-detects format from content: HTML (if contains HTML tags), Markdown (if starts with #, *, -), ReStructuredText (if starts with ..), and defaults to Markdown for plain text. Supported formats include all Pandoc inputs: Markdown variants, HTML, RST, LaTeX, DocBook, MediaWiki, Textile, and more.

What file formats can be converted from existing files?

Supported input files: PDF, DOCX, XLSX, PPTX (via Document Intelligence), EML/ICS email files, TXT/CSV/TSV text files, JSON and HTML files. All are converted to Markdown internally before final format conversion.

Why use XeLaTeX for PDF conversion?

XeLaTeX engine provides superior Unicode support, advanced typography, and better handling of complex layouts compared to pdfTeX. This ensures high-quality PDF output with proper font rendering and international character support.

How do I handle large files or batch conversions?

The block processes one file at a time. For large files, ensure adequate blob storage space. For batch processing, use multiple DocumentCreator blocks in parallel or implement a loop pattern with the file collection. Monitor memory usage for very large documents.

Can I preserve formatting when converting between formats?

Formatting preservation depends on target format capabilities. DOCX to PDF preserves most formatting. HTML to DOCX maintains structure and basic styling. Some advanced formatting may be lost in cross-format conversions due to format limitations.