Convert PDF, DOCX, XLSX, and text files to clean, structured Markdown. CJK-friendly, table-friendly, privacy-first.

234 stars
34 forks
Python
47 views

SKILL.md


name: doc-cleaner description: Convert PDF, DOCX, XLSX, and text files to clean, structured Markdown. CJK-friendly, table-friendly, privacy-first. version: 1.0.0 metadata: {"openclaw":{"emoji":"📄","homepage":"https://github.com/notoriouslab/doc-cleaner","requires":{"bins":["python3"]}}}

doc-cleaner

Convert documents (PDF, DOCX, XLSX, TXT) to clean, structured Markdown.

When to use

  • User asks to convert a document to Markdown
  • User wants to extract text or tables from PDF/DOCX/XLSX files
  • User wants to clean up bank statements or financial documents
  • User asks to process a batch of documents in a directory

Commands

Convert a single file (no AI, fastest)

python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai none

Convert a single file with AI structuring

python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai gemini

Convert a single file with Groq structuring

python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai groq

Convert all files in a directory

python3 {baseDir}/cleaner.py --input "{{directory}}" --ai none --output-dir "{{output_dir}}"

Preview without writing (dry run)

python3 {baseDir}/cleaner.py --input "{{file_path}}" --dry-run --verbose

Get machine-readable result summary

python3 {baseDir}/cleaner.py --input "{{file_path}}" --ai none --summary

The --summary flag prints a JSON summary to stdout after processing:

{"version":"1.0.0","total":3,"success":2,"failed":1,"files":[{"file":"report.pdf","output":"./output/report.md","status":"ok"},{"file":"scan.pdf","output":null,"status":"no_content"},{"file":"data.xlsx","output":"./output/data.md","status":"ok"}]}

Options

Flag Description
--input, -i File or directory to process (required, non-recursive)
--output-dir, -o Output directory (default: ./output)
--ai gemini, groq, ollama, or none (default: from config or gemini)
--password PDF decryption password
--config Path to config JSON
--summary Print JSON summary to stdout after processing
--dry-run Preview without writing files
--verbose Enable debug logging

Supported formats

PDF (native, scanned, encrypted), DOCX, XLSX, XLS, CSV, TXT, MD

Exit codes

Code Meaning
0 All files processed successfully
1 Some files failed (partial success)
2 No processable files found or config error

Notes

  • Output defaults to ./output/ relative to current directory
  • For scanned PDFs, AI mode (gemini, groq, or ollama) gives much better results
  • --ai none requires zero API keys and zero network access
  • CJK encoding (Big5, CP950, UTF-16) is auto-detected
  • Tables in DOCX and XLSX are preserved as Markdown pipe tables