pdf-processing

hemanth/agentu

Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.

10 stars
1 forks
4 views

SKILL.md


name: pdf-processing description: Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.

PDF Processing Skill

Quick Start

Use pdfplumber to extract text from PDFs:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    print(text)

Capabilities

1. Text Extraction

Extract all text from a PDF document:

def extract_all_text(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        full_text = ""
        for page in pdf.pages:
            full_text += page.extract_text() or ""
        return full_text

2. Table Extraction

Extract tables from PDF pages. For detailed table extraction, see [[TABLES.md]].

Basic example:

with pdfplumber.open("document.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()
    for table in tables:
        print(table)

3. Form Filling

Fill PDF forms programmatically. For comprehensive form-filling guide, see [[FORMS.md]].

Best Practices

  • Performance: For large PDFs, process page-by-page to avoid memory issues
  • OCR: For scanned PDFs without text layer, recommend using OCR tools first
  • Encoding: Handle UTF-8 encoding properly when extracting text

Common Use Cases

  • Invoice text extraction
  • Table data scraping from reports
  • PDF form automation
  • Document merging and splitting