mini-infer

psmarter/mini-infer

LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, OpenAI-compatible serving

182 stars

10 forks

Python

42 views

View on GitHub Add to Favorites

Installation

Option 1: Use slash command in Claude Code

/install-skill https://github.com/psmarter/mini-infer

Option 2: Clone to skills directory

# Global (all projects)

git clone https://github.com/psmarter/mini-infer ~/.claude/skills/mini-infer

# Project-specific

git clone https://github.com/psmarter/mini-infer .claude/skills/mini-infer

Add MCP server to .cursor/mcp.json:

{
  "mcpServers": {
    "skillz": {
      "command": "npx",
      "args": ["-y", "skillz-mcp", "https://github.com/psmarter/mini-infer"]
    }
  }
}

Restart Cursor after adding the configuration.

Option 1: Use Gemini CLI command

gemini extensions install https://github.com/psmarter/mini-infer

Option 2: Clone to extensions directory

git clone https://github.com/psmarter/mini-infer ~/.gemini/extensions/mini-infer

Topics

continuous-batching cuda inference inference-engine kv-cache language-model llm machine-learning moe pagedattention pytorch quantization speculative-decoding tensor-parallelism transformer triton

Related Skills

xlsx

Public repository for Agent Skills

skill-writer

Tensors and Dynamic neural networks in Python with strong GPU acceleration

youtube-downloader

A curated list of awesome Claude Skills, resources, and tools for customizing Claude AI workflows

agno

Build, run, manage agentic software at scale.