pdf-to-word
zuwiizu/pdf-to-wordConvert PDF files to editable Word documents (.docx). Preserves layout, images, and keeps text editable. Handles CMYK, ICCBased, and broken colorspace images. Ideal for translation workflows.
SKILL.md
name: pdf-to-word description: Convert PDF files to editable Word documents (.docx). Preserves layout, images, and keeps text editable. Handles CMYK, ICCBased, and broken colorspace images. Ideal for translation workflows. user-invocable: true allowed-tools: Bash, Read, Write, Glob, Grep argument-hint: [path-to-pdf-or-folder]
PDF to Editable Word Document Converter
Convert PDF files into editable Word documents where text remains as real text (not drawings/images). Images and layout are preserved.
Requirements
Install dependencies if not already present:
pip3 install pymupdf==1.24.14 pdf2docx python-docx Pillow
Important: pdf2docx requires pymupdf < 1.25 for compatibility.
Conversion Steps
-
Identify the input: Check if
$ARGUMENTSis a single PDF file, a folder of PDFs, or a zip file containing PDFs. -
If zip file: Extract it first to a temporary directory.
-
Analyze PDFs: Open each PDF with pymupdf and check for text extractability and image colorspaces.
-
Apply the colorspace patch: Before running pdf2docx, patch
ImagesExtractor.pyto handle CMYK/ICCBased/None colorspace images. The patch adds PIL fallback to three methods:_to_raw_dict()- wrapsimage.tobytes()in try/except, falls back to PIL conversion_pixmap_to_cv_image()- same PIL fallback for opencv conversion_recover_pixmap()- changes CMYK detection from string match topix.n == 4, adds PIL handling for None colorspace
See colorspace-patch.md for the exact patch code.
-
Convert with pdf2docx:
from pdf2docx import Converter
cv = Converter(pdf_path)
cv.convert(output_path)
cv.close()
- Output: Save
.docxfiles to aconverted_word_docs/directory next to the input.
Colorspace Patch Details
The patch is needed because many professionally designed PDFs use:
- ICCBased CMYK profiles (e.g., "U.S. Web Coated (SWOP) v2") - pymupdf can't convert these to PNG directly
- None colorspace with n=1 (grayscale images with broken metadata)
The fix: catch ValueError/RuntimeError from pixmap.tobytes() and fall back to PIL:
- n=1 -> PIL "L" mode (grayscale)
- n=3 -> PIL "RGB" mode
- n=4 -> PIL "CMYK" mode, then convert to RGB
File Locations
The patch must be applied to:
{site-packages}/pdf2docx/image/ImagesExtractor.py
Find the path with:
python3 -c "import pdf2docx; print(pdf2docx.__file__)"
Notes
- pdf2docx produces the best layout-preserving results for designed/illustrated PDFs
- File sizes will be large for image-heavy PDFs (the original images are embedded)
- All text in the output is editable and selectable - ready for translation
- The converter handles: text extraction, image preservation, CMYK conversion, heading detection, font styling