Run Python code in the cloud with serverless containers, GPUs, and autoscaling. Use when deploying ML models, running batch jobs, scheduling tasks, serving APIs with GPU acceleration, or scaling compute-intensive workloads. Triggers on requests for serverless GPU infrastructure, LLM inference, model training/fine-tuning, parallel data processing, cron jobs in the cloud, or deploying Python web endpoints.

0 stars
0 forks
75 views

SKILL.md


name: modal-deployment description: Run Python code in the cloud with serverless containers, GPUs, and autoscaling. Use when deploying ML models, running batch jobs, scheduling tasks, serving APIs with GPU acceleration, or scaling compute-intensive workloads. Triggers on requests for serverless GPU infrastructure, LLM inference, model training/fine-tuning, parallel data processing, cron jobs in the cloud, or deploying Python web endpoints.

Modal

Modal is a serverless platform for running Python in the cloud with zero configuration. Define everything in code—no YAML, Docker, or Kubernetes required.

Quick Start

import modal

app = modal.App("my-app")

@app.function()
def hello():
    return "Hello from Modal!"

@app.local_entrypoint()
def main():
    print(hello.remote())

Run: modal run app.py

Core Concepts

Functions

Decorate Python functions to run remotely:

@app.function(gpu="H100", memory=32768, timeout=600)
def train_model(data):
    # Runs on H100 GPU with 32GB RAM, 10min timeout
    return model.fit(data)

Images

Define container environments via method chaining:

image = (
    modal.Image.debian_slim(python_version="3.12")
    .apt_install("ffmpeg", "libsndfile1")
    .uv_pip_install("torch", "transformers", "numpy")
    .env({"CUDA_VISIBLE_DEVICES": "0"})
)

app = modal.App("ml-app", image=image)

Key image methods:

  • .debian_slim() / .micromamba() - Base images
  • .uv_pip_install() / .pip_install() - Python packages
  • .apt_install() - System packages
  • .run_commands() - Shell commands
  • .add_local_python_source() - Local modules
  • .env() - Environment variables

GPUs

Attach GPUs with a single parameter:

@app.function(gpu="H100")      # Single H100
@app.function(gpu="A100-80GB") # 80GB A100
@app.function(gpu="H100:4")    # 4x H100
@app.function(gpu=["H100", "A100-40GB:2"])  # Fallback options

Available: B200, H200, H100, A100-80GB, A100-40GB, L40S, L4, A10G, T4

Classes with Lifecycle Hooks

Load models once at container startup:

@app.cls(gpu="L40S")
class Model:
    @modal.enter()
    def load(self):
        self.model = load_pretrained("model-name")
    
    @modal.method()
    def predict(self, x):
        return self.model(x)

# Usage
Model().predict.remote(data)

Web Endpoints

Deploy APIs instantly:

@app.function()
@modal.fastapi_endpoint()
def api(text: str):
    return {"result": process(text)}

# For complex apps
@app.function()
@modal.asgi_app()
def fastapi_app():
    from fastapi import FastAPI
    web = FastAPI()
    
    @web.get("/health")
    def health():
        return {"status": "ok"}
    
    return web

Volumes (Persistent Storage)

volume = modal.Volume.from_name("my-data", create_if_missing=True)

@app.function(volumes={"/data": volume})
def save_file(content: str):
    with open("/data/output.txt", "w") as f:
        f.write(content)
    volume.commit()  # Persist changes

Secrets

@app.function(secrets=[modal.Secret.from_name("my-api-key")])
def call_api():
    import os
    key = os.environ["API_KEY"]

Create secrets: Dashboard or modal secret create my-secret KEY=value

Dicts (Distributed Key-Value Store)

cache = modal.Dict.from_name("my-cache", create_if_missing=True)

@app.function()
def cached_compute(key: str):
    if key in cache:
        return cache[key]
    result = expensive_computation(key)
    cache[key] = result
    return result

Queues (Distributed FIFO)

queue = modal.Queue.from_name("task-queue", create_if_missing=True)

@app.function()
def producer():
    queue.put_many([{"task": i} for i in range(10)])

@app.function()
def consumer():
    while task := queue.get(timeout=60):
        process(task)

Parallel Processing

# Map over inputs (auto-parallelized)
results = list(process.map(items))

# Spawn async jobs
calls = [process.spawn(item) for item in items]
results = [call.get() for call in calls]

# Batch processing (up to 1M inputs)
process.spawn_map(range(100_000))

Scheduling

@app.function(schedule=modal.Period(hours=1))
def hourly_job():
    pass

@app.function(schedule=modal.Cron("0 9 * * 1-5"))  # 9am weekdays
def daily_report():
    pass

CLI Commands

modal run app.py          # Run locally-triggered function
modal serve app.py        # Hot-reload web endpoints
modal deploy app.py       # Deploy persistently
modal shell app.py        # Interactive shell in container
modal app list            # List deployed apps
modal app logs <name>     # Stream logs
modal volume list         # List volumes
modal secret list         # List secrets

Common Patterns

LLM Inference

@app.cls(gpu="H100", image=image)
class LLM:
    @modal.enter()
    def load(self):
        from vllm import LLM
        self.llm = LLM("meta-llama/Llama-3-8B")
    
    @modal.method()
    def generate(self, prompt: str):
        return self.llm.generate(prompt)

Download Models at Build Time

def download_model():
    from huggingface_hub import snapshot_download
    snapshot_download("model-id", local_dir="/models")

image = (
    modal.Image.debian_slim()
    .pip_install("huggingface-hub")
    .run_function(download_model)
)

Concurrency for I/O-bound Work

@app.function()
@modal.concurrent(max_inputs=100)
async def fetch_urls(url: str):
    async with aiohttp.ClientSession() as session:
        return await session.get(url)

Memory Snapshots (Faster Cold Starts)

@app.cls(enable_memory_snapshot=True, gpu="A10G")
class FastModel:
    @modal.enter(snap=True)
    def load(self):
        self.model = load_model()  # Snapshot this state

Autoscaling

@app.function(
    min_containers=2,       # Always keep 2 warm
    max_containers=100,     # Scale up to 100
    buffer_containers=5,    # Extra buffer for bursts
    scaledown_window=300,   # Keep idle for 5 min
)
def serve():
    pass

Best Practices

  1. Put imports inside functions when packages aren't installed locally
  2. Use @modal.enter() for expensive initialization (model loading)
  3. Pin dependency versions for reproducible builds
  4. Use Volumes for model weights and persistent data
  5. Use memory snapshots for sub-second cold starts in production
  6. Set appropriate timeouts for long-running tasks
  7. Use min_containers=1 for production APIs to keep containers warm
  8. Use absolute imports with full package paths (not relative imports)

Fast Image Builds with uv_sync

Use .uv_sync() instead of .pip_install() for faster dependency installation:

# In pyproject.toml, define dependency groups:
# [dependency-groups]
# modal = ["fastapi", "pydantic-ai>=1.0.0", "logfire"]

image = (
    modal.Image.debian_slim(python_version="3.12")
    .uv_sync("agent", groups=["modal"], frozen=False)
    .add_local_python_source("agent.src")  # Use dot notation for packages
)

Key points:

  • Deploy from project root: modal deploy agent/src/api.py
  • Use dot notation in .add_local_python_source("package.subpackage")
  • Imports must match: from agent.src.config import ... (not relative from .config)

Logfire Observability

Add observability with Logfire (especially for pydantic-ai):

@app.cls(image=image, secrets=[..., modal.Secret.from_name("logfire")], min_containers=1)
class Web:
    @modal.enter()
    def startup(self):
        import logfire
        logfire.configure(send_to_logfire="if-token-present", environment="production", service_name="my-agent")
        logfire.instrument_pydantic_ai()
        self.agent = create_agent()

Reference Documentation

See references/ for detailed guides on images, functions, GPUs, scaling, web endpoints, storage, dicts, queues, sandboxes, and networking.

Official docs: https://modal.com/docs