mini-infer

psmarter/mini-infer

LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, OpenAI-compatible serving

262 stars

17 forks

Python

156 views

View on GitHub Add to Favorites

Installation

Option 1: Use slash command in Claude Code

/install-skill https://github.com/psmarter/mini-infer

Option 2: Clone to skills directory

# Global (all projects)

git clone https://github.com/psmarter/mini-infer ~/.claude/skills/mini-infer

# Project-specific

git clone https://github.com/psmarter/mini-infer .claude/skills/mini-infer

Add MCP server to .cursor/mcp.json:

{
  "mcpServers": {
    "skillz": {
      "command": "npx",
      "args": ["-y", "skillz-mcp", "https://github.com/psmarter/mini-infer"]
    }
  }
}

Restart Cursor after adding the configuration.

Option 1: Use Gemini CLI command

gemini extensions install https://github.com/psmarter/mini-infer

Option 2: Clone to extensions directory

git clone https://github.com/psmarter/mini-infer ~/.gemini/extensions/mini-infer

Topics

continuous-batching cuda inference inference-engine kv-cache language-model llm machine-learning moe pagedattention pytorch quantization speculative-decoding tensor-parallelism transformer triton

Related Skills

Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.

skill-writer

Tensors and Dynamic neural networks in Python with strong GPU acceleration