pdf-to-word

zuwiizu/pdf-to-word

Convert PDF files to editable Word documents (.docx). Preserves layout, images, and keeps text editable. Handles CMYK, ICCBased, and broken colorspace images. Ideal for translation workflows.

1 stars

0 forks

Python

65 views

View on GitHub Add to Favorites

SKILL.md

name: pdf-to-word description: Convert PDF files to editable Word documents (.docx). Preserves layout, images, and keeps text editable. Handles CMYK, ICCBased, and broken colorspace images. Ideal for translation workflows. user-invocable: true allowed-tools: Bash, Read, Write, Glob, Grep argument-hint: [path-to-pdf-or-folder]

PDF to Editable Word Document Converter

Convert PDF files into editable Word documents where text remains as real text (not drawings/images). Images and layout are preserved.

Requirements

Install dependencies if not already present:

pip3 install pymupdf==1.24.14 pdf2docx python-docx Pillow

Important: pdf2docx requires pymupdf < 1.25 for compatibility.

Conversion Steps

Identify the input: Check if $ARGUMENTS is a single PDF file, a folder of PDFs, or a zip file containing PDFs.
If zip file: Extract it first to a temporary directory.
Analyze PDFs: Open each PDF with pymupdf and check for text extractability and image colorspaces.
Apply the colorspace patch: Before running pdf2docx, patch ImagesExtractor.py to handle CMYK/ICCBased/None colorspace images. The patch adds PIL fallback to three methods:
- _to_raw_dict() - wraps image.tobytes() in try/except, falls back to PIL conversion
- _pixmap_to_cv_image() - same PIL fallback for opencv conversion
- _recover_pixmap() - changes CMYK detection from string match to pix.n == 4, adds PIL handling for None colorspace
See colorspace-patch.md for the exact patch code.
Convert with pdf2docx:

from pdf2docx import Converter
cv = Converter(pdf_path)
cv.convert(output_path)
cv.close()

Output: Save .docx files to a converted_word_docs/ directory next to the input.

Colorspace Patch Details

The patch is needed because many professionally designed PDFs use:

ICCBased CMYK profiles (e.g., "U.S. Web Coated (SWOP) v2") - pymupdf can't convert these to PNG directly
None colorspace with n=1 (grayscale images with broken metadata)

The fix: catch ValueError/RuntimeError from pixmap.tobytes() and fall back to PIL:

n=1 -> PIL "L" mode (grayscale)
n=3 -> PIL "RGB" mode
n=4 -> PIL "CMYK" mode, then convert to RGB

File Locations

The patch must be applied to:

{site-packages}/pdf2docx/image/ImagesExtractor.py

Find the path with:

python3 -c "import pdf2docx; print(pdf2docx.__file__)"

Notes

pdf2docx produces the best layout-preserving results for designed/illustrated PDFs
File sizes will be large for image-heavy PDFs (the original images are embedded)
All text in the output is editable and selectable - ready for translation
The converter handles: text extraction, image preservation, CMYK conversion, heading detection, font styling

Installation

Option 1: Use slash command in Claude Code

/install-skill https://github.com/zuwiizu/pdf-to-word

Option 2: Clone to skills directory

# Global (all projects)

git clone https://github.com/zuwiizu/pdf-to-word ~/.claude/skills/pdf-to-word

# Project-specific

git clone https://github.com/zuwiizu/pdf-to-word .claude/skills/pdf-to-word

Add MCP server to .cursor/mcp.json:

{
  "mcpServers": {
    "skillz": {
      "command": "npx",
      "args": ["-y", "skillz-mcp", "https://github.com/zuwiizu/pdf-to-word"]
    }
  }
}

Restart Cursor after adding the configuration.

Option 1: Use Gemini CLI command

gemini extensions install https://github.com/zuwiizu/pdf-to-word

Option 2: Clone to extensions directory

git clone https://github.com/zuwiizu/pdf-to-word ~/.gemini/extensions/pdf-to-word

Topics

Related Skills

Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.

skill-writer

Tensors and Dynamic neural networks in Python with strong GPU acceleration