DocumentCreator 1.0.0¶

Overview¶

Data Beginner

Description¶

Converts a string (Markdown, HTML, or other supported formats) to a specified document format (Word, PDF, ODT, EPUB, etc.) using Pandoc and saves it to blob storage.Retrieves an existing file from blob storage, converts it to a specified document format (Word, PDF, ODT, EPUB, etc.) using Pandoc, and saves the converted file back to blob storage.

Configuration Options¶

Name	Data Type	Description	Default Value
output_file_type	`FileType`	Target document format for conversion. Supported formats: DOCX, PDF, ODT, EPUB, HTML, RTF, JSON, PPTX, MARKDOWN. PDF conversion uses XeLaTeX engine for advanced formatting.	`FileType.DOCX`
title	`str`	Output filename (without extension) for the converted document. Used as the base name when saving to blob storage.	`Document_Title`

Inputs¶

Name	Data Type	Description
content	`Union[str, File]`	Input content for conversion. Can be a string containing Markdown, HTML, or other supported markup formats, OR an existing File object from blob storage (supports PDF, DOCX, XLSX, PPTX, EML, ICS, TXT, CSV, TSV, JSON, HTML formats).

Outputs¶

Name	Data Type	Description
result	`File`	Converted document file saved in blob storage. Contains file metadata (id, name, size, content_type) and can be used as input for other file processing blocks.

Examples¶

# Convert Markdown text to Word document
- id: create_word_doc
  uses: DocumentCreator@1.0.0
  with:
    title: "Sales Report Q4"
    output_file_type: "docx"
    content: |
      # Q4 Sales Report

      ## Executive Summary
      Sales increased by 23% compared to Q3, reaching $2.4M total revenue.

      ## Key Metrics
      - New customers: 450
      - Retention rate: 87%
      - Average deal size: $5,300

      ## Next Steps
      Focus on enterprise accounts and product expansion.
  outputs:
    result: sales_report_docx

    

# Convert HTML content to PDF with custom title
- id: html_to_pdf
  uses: DocumentCreator@1.0.0
  with:
    title: "Company_Handbook_2024"
    output_file_type: "pdf"
    content: |
      <!DOCTYPE html>
      <html>
      <head><title>Employee Handbook</title></head>
      <body>
        <h1>Welcome to Our Company</h1>
        <h2>Code of Conduct</h2>
        <p>We value integrity, innovation, and collaboration.</p>
        <h2>Policies</h2>
        <ul>
          <li>Remote work: Flexible arrangements available</li>
          <li>PTO: 25 days annually</li>
          <li>Health benefits: Full coverage</li>
        </ul>
      </body>
      </html>
  outputs:
    result: handbook_pdf

# Create EPUB from ReStructuredText
- id: rst_to_epub
  uses: DocumentCreator@1.0.0
  with:
    title: "User_Guide_v2"
    output_file_type: "epub"
    content: |
      User Guide
      ==========

      Installation
      ------------

      Follow these steps to install the software:

      1. Download the installer
      2. Run the setup wizard
      3. Complete configuration

      Configuration
      -------------

      Set your preferences in the Settings panel.
  outputs:
    result: user_guide_epub

    

# Convert existing PDF to Word document
- id: pdf_to_word
  uses: DocumentCreator@1.0.0
  with:
    output_file_type: "docx"
    content: "Reference to PDF file from previous step"
  outputs:
    result: converted_word_doc

# Convert DOCX to HTML for web publishing
- id: docx_to_html
  uses: DocumentCreator@1.0.0
  with:
    title: "Web_Article_2024"
    output_file_type: "html"
    content: "Reference to DOCX file from previous step"
  outputs:
    result: web_article_html

# Convert email file to PDF
- id: email_to_pdf
  uses: DocumentCreator@1.0.0
  with:
    title: "Important_Email_Archive"
    output_file_type: "pdf"
    content: "Reference to EML file from email processing"
  outputs:
    result: archived_email_pdf

    

Error Handling¶

PandocConversionError

Error Code: pandoc_conversion_failed
Common Cause: Pandoc cannot convert the input format to target format due to unsupported features or malformed input
Solution: Check input format compatibility, ensure content is valid markup, verify Pandoc supports the conversion path

BlobStorageError

Error Code: blob_save_failed
Common Cause: Failed to save converted file to blob storage due to storage limits or connection issues
Solution: Verify blob storage connectivity, check storage quotas, ensure proper blob service configuration

FileProcessingError

Error Code: file_processing_failed
Common Cause: Input file is corrupted, empty, or in unsupported format for extraction/conversion
Solution: Verify input file integrity, check file format support, ensure file is not password-protected or corrupted

FAQ¶

What input formats are supported for string content?

The block auto-detects format from content: HTML (if contains HTML tags), Markdown (if starts with #, *, -), ReStructuredText (if starts with ..), and defaults to Markdown for plain text. Supported formats include all Pandoc inputs: Markdown variants, HTML, RST, LaTeX, DocBook, MediaWiki, Textile, and more.

What file formats can be converted from existing files?

Supported input files: PDF, DOCX, XLSX, PPTX (via Document Intelligence), EML/ICS email files, TXT/CSV/TSV text files, JSON and HTML files. All are converted to Markdown internally before final format conversion.

Why use XeLaTeX for PDF conversion?

XeLaTeX engine provides superior Unicode support, advanced typography, and better handling of complex layouts compared to pdfTeX. This ensures high-quality PDF output with proper font rendering and international character support.

How do I handle large files or batch conversions?

The block processes one file at a time. For large files, ensure adequate blob storage space. For batch processing, use multiple DocumentCreator blocks in parallel or implement a loop pattern with the file collection. Monitor memory usage for very large documents.

Can I preserve formatting when converting between formats?

Formatting preservation depends on target format capabilities. DOCX to PDF preserves most formatting. HTML to DOCX maintains structure and basic styling. Some advanced formatting may be lost in cross-format conversions due to format limitations.