training-hub
RobbieJ/training-hub-skillsFine-tune LLMs using Red Hat training-hub library with SFT, LoRA, and OSFT algorithms. Use when preparing JSONL datasets, running training jobs, configuring hardware, scaling to clusters, evaluating models, or deploying with vLLM.
SKILL.md
name: training-hub description: Fine-tune LLMs using Red Hat training-hub library with SFT, LoRA, and OSFT algorithms. Use when preparing JSONL datasets, running training jobs, configuring hardware, scaling to clusters, evaluating models, or deploying with vLLM.
Training Hub
Red Hat's unified library for LLM post-training: SFT, LoRA, and OSFT (continual learning).
Quick Reference
| Task | Command |
|---|---|
| Recommend config | python scripts/recommend_config.py --model <model> --hardware <hw> |
| Estimate memory | python scripts/estimate_memory.py --model <model> --method sft --hardware h100 |
| Validate dataset | python scripts/validate_dataset.py data.jsonl |
| Full fine-tuning | from training_hub import sft |
| LoRA training | from training_hub import lora_sft |
| OSFT (continual) | from training_hub import osft |
Installation
pip install training-hub # Basic
pip install training-hub[lora] # LoRA with Unsloth (2x faster)
pip install training-hub[cuda] --no-build-isolation # CUDA support
Get Started Fast
# Get optimal config for your hardware
python scripts/recommend_config.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--hardware rtx-5090
Data Format
Training data must be JSONL with message structure:
{"messages": [{"role": "user", "content": "Q"}, {"role": "assistant", "content": "A"}]}
Validate before training:
python scripts/validate_dataset.py ./training_data.jsonl
For data preparation details, see DATA-FORMATS.md.
Training Methods
Supervised Fine-Tuning (SFT)
Full-parameter fine-tuning. Requires significant VRAM.
from training_hub import sft
result = sft(
model_path="Qwen/Qwen2.5-7B-Instruct",
data_path="./training_data.jsonl",
ckpt_output_dir="./checkpoints",
num_epochs=3,
effective_batch_size=8,
learning_rate=2e-5,
max_seq_len=2048,
max_tokens_per_gpu=45000,
)
LoRA Fine-Tuning
Memory-efficient adaptation (up to 2x faster, 70% less VRAM):
from training_hub import lora_sft
result = lora_sft(
model_path="Qwen/Qwen2.5-7B-Instruct",
data_path="./training_data.jsonl",
ckpt_output_dir="./outputs",
lora_r=16,
lora_alpha=32,
num_epochs=3,
learning_rate=2e-4,
)
QLoRA (4-bit): Add load_in_4bit=True for large models on limited VRAM.
OSFT (Continual Learning)
Adapt without catastrophic forgetting:
from training_hub import osft
result = osft(
model_path="meta-llama/Llama-3.1-8B-Instruct",
data_path="./domain_data.jsonl",
ckpt_output_dir="./checkpoints",
unfreeze_rank_ratio=0.25,
effective_batch_size=16,
learning_rate=2e-5,
)
For all parameters, see ALGORITHMS.md.
Hardware Support
| Hardware | VRAM | Best For |
|---|---|---|
| RTX 5090 | 32GB | 8B LoRA, 70B QLoRA |
| DGX Spark | 128GB | 70B SFT |
| H100 | 80GB | 14B SFT, 70B LoRA |
| 8×H100 | 640GB | 70B SFT |
# Check if your config fits
python scripts/estimate_memory.py \
--model meta-llama/Llama-3.1-70B-Instruct \
--method lora \
--hardware h100 \
--num-gpus 8
For hardware-specific configs, see HARDWARE.md.
Scaling
Multi-GPU:
result = sft(..., nproc_per_node=8)
Multi-node:
result = sft(..., nnodes=2, node_rank=0, nproc_per_node=8, rdzv_endpoint="0.0.0.0:29500")
For Slurm, Kubernetes, and datacenter deployments, see SCALE.md.
Algorithm Selection
| Scenario | Method |
|---|---|
| First-time fine-tuning, large dataset | SFT |
| Memory constrained | LoRA |
| Very large model (70B+), limited VRAM | LoRA + QLoRA |
| Preserve existing capabilities | OSFT |
| Domain adaptation, small dataset | OSFT |
Documentation
| Topic | File |
|---|---|
| Hardware profiles & configs | HARDWARE.md |
| All algorithm parameters | ALGORITHMS.md |
| Data formats & conversion | DATA-FORMATS.md |
| Datacenter & cluster setup | SCALE.md |
| Model evaluation | EVALUATION.md |
| vLLM inference & serving | INFERENCE.md |
| Advanced techniques | ADVANCED.md |
| Model-specific configs | MODELS.md |
| Troubleshooting | TROUBLESHOOTING.md |
| Distributed training | DISTRIBUTED.md |
Utility Scripts
| Script | Purpose |
|---|---|
recommend_config.py |
Generate optimal config for model + hardware |
estimate_memory.py |
Estimate GPU memory requirements |
validate_dataset.py |
Validate JSONL dataset format |
convert_to_jsonl.py |
Convert CSV, Alpaca, ShareGPT to JSONL |
Troubleshooting
CUDA OOM: Reduce max_tokens_per_gpu, use LoRA + QLoRA, or add GPUs
Dataset errors: Run python scripts/validate_dataset.py first
LoRA multi-GPU: Requires torchrun --nproc-per-node=N script.py
Training diverges: Lower learning_rate (try 1e-5 for SFT, 1e-4 for LoRA)
For more, see TROUBLESHOOTING.md.