nemo-curator
GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.
$ Installer
git clone https://github.com/zechenzhangAGI/AI-research-SKILLs /tmp/AI-research-SKILLs && cp -r /tmp/AI-research-SKILLs/05-data-processing/nemo-curator ~/.claude/skills/AI-research-SKILLs// tip: Run this command in your terminal to install the skill
Repository

zechenzhangAGI
Author
zechenzhangAGI/AI-research-SKILLs/05-data-processing/nemo-curator
62
Stars
2
Forks
Updated6d ago
Added6d ago