Convert PDF files to editable Word documents (.docx). Preserves layout, images, and keeps text editable. Handles CMYK, ICCBased, and broken colorspace images. Ideal for translation workflows.

1 stars
0 forks
Python
65 views

SKILL.md


name: pdf-to-word description: Convert PDF files to editable Word documents (.docx). Preserves layout, images, and keeps text editable. Handles CMYK, ICCBased, and broken colorspace images. Ideal for translation workflows. user-invocable: true allowed-tools: Bash, Read, Write, Glob, Grep argument-hint: [path-to-pdf-or-folder]

PDF to Editable Word Document Converter

Convert PDF files into editable Word documents where text remains as real text (not drawings/images). Images and layout are preserved.

Requirements

Install dependencies if not already present:

pip3 install pymupdf==1.24.14 pdf2docx python-docx Pillow

Important: pdf2docx requires pymupdf < 1.25 for compatibility.

Conversion Steps

  1. Identify the input: Check if $ARGUMENTS is a single PDF file, a folder of PDFs, or a zip file containing PDFs.

  2. If zip file: Extract it first to a temporary directory.

  3. Analyze PDFs: Open each PDF with pymupdf and check for text extractability and image colorspaces.

  4. Apply the colorspace patch: Before running pdf2docx, patch ImagesExtractor.py to handle CMYK/ICCBased/None colorspace images. The patch adds PIL fallback to three methods:

    • _to_raw_dict() - wraps image.tobytes() in try/except, falls back to PIL conversion
    • _pixmap_to_cv_image() - same PIL fallback for opencv conversion
    • _recover_pixmap() - changes CMYK detection from string match to pix.n == 4, adds PIL handling for None colorspace

    See colorspace-patch.md for the exact patch code.

  5. Convert with pdf2docx:

from pdf2docx import Converter
cv = Converter(pdf_path)
cv.convert(output_path)
cv.close()
  1. Output: Save .docx files to a converted_word_docs/ directory next to the input.

Colorspace Patch Details

The patch is needed because many professionally designed PDFs use:

  • ICCBased CMYK profiles (e.g., "U.S. Web Coated (SWOP) v2") - pymupdf can't convert these to PNG directly
  • None colorspace with n=1 (grayscale images with broken metadata)

The fix: catch ValueError/RuntimeError from pixmap.tobytes() and fall back to PIL:

  • n=1 -> PIL "L" mode (grayscale)
  • n=3 -> PIL "RGB" mode
  • n=4 -> PIL "CMYK" mode, then convert to RGB

File Locations

The patch must be applied to:

{site-packages}/pdf2docx/image/ImagesExtractor.py

Find the path with:

python3 -c "import pdf2docx; print(pdf2docx.__file__)"

Notes

  • pdf2docx produces the best layout-preserving results for designed/illustrated PDFs
  • File sizes will be large for image-heavy PDFs (the original images are embedded)
  • All text in the output is editable and selectable - ready for translation
  • The converter handles: text extraction, image preservation, CMYK conversion, heading detection, font styling