github-repo-skill

turbomam/cmm-ai-automation

Guide for creating new GitHub repos and best practice for existing GitHub repos, applicable to both code and non-code projects

0 stars

0 forks

Python

15 views

View on GitHub Add to Favorites

SKILL.md

name: github-repo-skill description: Guide for creating new GitHub repos and best practice for existing GitHub repos, applicable to both code and non-code projects license: CC-0

github-repo-skill

overview

To create and maintain high-quality repos that conform to Mungall group / BBOP best practice, use this skill. Use this skill regardless of whether the repo is for code or non-code (ontology, linkml schemas, curated content, analyses, websites). Use this skill for both new repos, for migrating legacy repos, or for ongoing maintenance.

Principles

Follow existing copier templates

The Mungall group favors the use of copier and blesses the following templates:

For LinkML schemas: https://github.com/linkml/linkml-project-copier
For code: https://github.com/monarch-initiative/monarch-project-copier
For ontologies: https://github.com/INCATools/ontology-development-kit (uses bespoke framework, not copier)

These should always be used for new repos. Pre-existing repos should try and follow these or migrate towards them.

Additionally the group uses additional drop-in templates for AI integrations:

https://github.com/ai4curation/github-ai-integrations

Favored tools

These are included in the templates above but some general over-arching preferences:

modern python dev stack: uv, ruff (currently mypy for typing but we may switch to https://docs.astral.sh/ty/)
for builds, both just and make are favored, with just favored for non-pipeline cases

Engineering best practice

pydantic or pydantic generated from LinkML for data models and data access objects (dataclasses are fine for engine objects)
always make sure function/method parameters and return objects are typed. Use mypy or equivalent to test.
testing:
- follow TDD, use pytest-style tests, @pytest.mark.parametrize is good for combinatorial testing
- always use doctests: make them informative for humans but also serving as additional tests
- ensure unit tests and tests that depend on external APIs, infrastructure etc are separated (e.g. pytest.mark.integration)
- for testing external APIs or services, use vcrpy
- do not create mock tests unless explicitly requested
- for data-oriented projects, yaml, tsvs, etc can go in tests/input or smilar
- for schemas, follow the linkml copier template, and ensure schemas and example test data is validated
- for ontologies, follow ODK best practice and ensure ontologies are comprehensively axiomatized to allow for reasoner-based checking
jupyter notebooks are good for documentation, dashboards, and analysis, but ensure that core logic is separated out and has unit tests
CLI:
- Every library should have a fully featured CLI
- typer is favored, but click is also good.
- CLIs, APIs, and MCPs should be shims on top of core logic
- have separate test for both core logic and CLIs.
- Use idiomatic options and clig conventions. Group standards: -i/--input, -o/--output (default stdout), -f/--format (input format), -O/--output-format, -v/-vv
- When testing Typer/Rich CLIs, set NO_COLOR=1 and TERM=dumb env vars to avoid ANSI escape codes breaking string assertions in CI.
Exceptions
- In general you should not need to worry about catching exceptions, although for a well-polished CLI some catching in the CLI layer is fine
- IMPORTANT: better to fail fast and know there is a problem than to defensively catch and carry on as if everything is OK (general principle: transparency)

Dependency management

uv add to add new dependencies (or uv add --dev or similar for dev dependencies)
libraries should allow somewhat relaxed dependencies to avoid diamond dependency problems. applications and infra can pin more tightly.

Git and GitHub Practices

always work on branches, commit early and often, make PRs early
in general, one PR = one issue (avoid mixing orthogonal concerns). Always reference issues in commits/PR messages
use gh on command line for operations like finding issues, creating PRs
all the group copier templates include extensive github actions for ensuring PRs are high quality
github repo should have adequate metadata, links to docs, tags

Source of truth

always have a clear source of truth (SoT) for all content, with github favored
where projects dictate SoT is google docs/sheets, use https://rclone.org/ to sync

Documentation

markdown is always favored, but older sites may use sphinx
Follow Diátaxis framework: tutorial, how-to, reference, explanation
Use examples extensively - examples can double as tests
frameworks: mkdocs is generally favored due to simplicity but sphinx is ok for older projects
Every project must have a comprehensive up to date README.md (or the README.md can point to site generated from mkdocs)
jupyter notebooks can serve as combined integration tests/docs, use mkdocs-jupyter, for CLI examples, use %%bash
Formatting tips: lists should be preceded by a blank line to avoid formatting issues with mkdocs

README

cmm-ai-automation

AI-assisted automation for Critical Mineral Metabolism (CMM) data curation using LinkML, OBO Foundry tools, and Google Sheets integration.

Collaboration

This repository is developed in collaboration with CultureBotAI/CMM-AI, which focuses on AI-driven discovery of microorganisms relevant to critical mineral metabolism. While CMM-AI handles the biological discovery and analysis workflows, this repository provides:

Schema-driven data modeling with LinkML
Integration with private Google Sheets data sources
OBO Foundry ontology tooling for semantic annotation

Integration with Knowledge Graph Ecosystem

This project integrates with several knowledge graph and ontology resources:

Project	Integration
kg-microbe	Source of microbial knowledge graph data; CMM strains are linked via `kg_node_ids`
kgx	Knowledge graph exchange format for importing/exporting Biolink Model-compliant data
biolink-model	Schema and upper ontology for biological knowledge representation
biolink-model-toolkit	Python utilities for working with Biolink Model
metpo	Microbial Phenotype Ontology for annotating phenotypic traits of CMM-relevant organisms

OLS Embeddings for Semantic Search

This project can leverage pre-computed embeddings from the Ontology Lookup Service (OLS) for semantic search and term mapping. See cthoyt.com/2025/08/04/ontology-text-embeddings.html for background.

Local embeddings database:

~9.5 million term embeddings from OLS-registered ontologies
Model: OpenAI text-embedding-3-small (1536 dimensions)
Schema: (ontologyId, entityType, iri, document, model, hash, embeddings)
Embeddings stored as JSON strings

Planned use cases:

Search Google Sheets content (strain names, media ingredients) against ontology terms
Generate candidate mappings for unmapped terms
Create CMM-specific embedding subsets for faster search

Reference implementation: berkeleybop/metpo embeddings search code.

CMM-AI Data Sources and APIs

The collaborating CultureBotAI/CMM-AI project uses the following APIs and data sources:

NCBI APIs (via Biopython Entrez):

API	Used For
Entrez esearch/efetch/esummary	Assembly, BioSample, Taxonomy, PubMed/PMC
PMC ID Converter	PMID to PMC ID resolution
GEO/SRA	Transcriptomics datasets

Other APIs:

API	Used For
KEGG REST	Metabolic pathways
PubChem REST	Chemical compounds
RCSB PDB	Protein structures
UniProt	Protein sequences and annotations

Database links generated:

Culture collections: ATCC, DSMZ, NCIMB
MetaCyc pathways
DOI resolution
AlphaFold predictions
JGI IMG/GOLD

Ontologies used: CHEBI, GO, ENVO, OBI, NCBITaxon, MIxS, RHEA, BAO

Related issues:

CMM-AI #38 - Document how to obtain KG-Microbe database files
CMM-AI #37 - Document sources for curated media data
CMM-AI #16 - Document the 5 Data Sources in Schema

Features

LinkML Schema: Data models for CMM microbial strain data
Google Sheets Integration: Read/write access to private Google Sheets (e.g., BER CMM Data)
AI Automation: GitHub Actions with Claude Code for issue triage, summarization, and code assistance
OBO Foundry Tools: Integration with OLS (Ontology Lookup Service) for ontology term lookup

Quick Start

# Clone the repository
git clone https://github.com/turbomam/cmm-ai-automation.git
cd cmm-ai-automation

# Install dependencies with uv
uv sync

# Set up Google Sheets credentials (service account)
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account.json

# Or place credentials in default location
# ~/.config/gspread/service_account.json

Google Sheets Usage

from cmm_ai_automation.gsheets import get_sheet_data, list_worksheets

# List available tabs in the BER CMM spreadsheet
tabs = list_worksheets("BER CMM Data for AI - for editing")
print(tabs)

# Read data from a specific tab
df = get_sheet_data("BER CMM Data for AI - for editing", "media_ingredients")
print(df.head())

AI Integration

This repo includes GitHub Actions that respond to @claude mentions in issues and PRs:

Issue triage and labeling
Issue summarization
Code assistance and PR reviews

Requires CLAUDE_CODE_OAUTH_TOKEN secret to be configured.

Documentation Website

https://turbomam.github.io/cmm-ai-automation

Repository Structure

docs/ - mkdocs-managed documentation
- elements/ - generated schema documentation
examples/ - Examples of using the schema
project/ - project files (these files are auto-generated, do not edit)
src/ - source files (edit these)
- cmm_ai_automation
  - schema/ -- LinkML schema (edit this)
  - datamodel/ -- generated Python datamodel
tests/ - Python tests
- data/ - Example data

Developer Tools

There are several pre-defined command-recipes available. They are written for the command runner just. To list all pre-defined commands, run just or just --list.

Testing and Quality Assurance

Quick Start

# Install all dependencies including QA tools
uv sync --group qa

# Run unit tests (fast, no network)
uv run pytest

# Run with coverage report
uv run pytest --cov=cmm_ai_automation

Test Categories

Command	What it runs	Speed
`uv run pytest`	Unit tests only (default)	~1.5s
`uv run pytest -m integration`	Integration tests (real API calls)	Slower
`uv run pytest --cov=cmm_ai_automation`	Unit tests with coverage	~9s
`uv run pytest --durations=20`	Show slowest 20 tests	~1.5s

Pre-commit Hooks

Pre-commit hooks run automatically before each commit, catching issues early:

# Install pre-commit hooks (one-time setup)
uv sync --group qa
uv run pre-commit install

# Run all hooks manually on all files
uv run pre-commit run --all-files

# Run specific hook
uv run pre-commit run ruff --all-files
uv run pre-commit run mypy --all-files

Hooks included:

ruff - Fast Python linter and formatter
ruff-format - Code formatting
mypy - Static type checking
yamllint - YAML linting
codespell - Spell checking
typos - Fast typo detection
deptry - Dependency checking
check-yaml, end-of-file-fixer, trailing-whitespace - General file hygiene

Running Individual QA Tools

# Linting with ruff
uv run ruff check src/
uv run ruff check --fix src/  # Auto-fix issues

# Type checking with mypy
uv run mypy src/cmm_ai_automation/

# Format code
uv run ruff format src/

# Check dependencies
uv run deptry src/

Thorough QA Check (CI-equivalent)

Run everything that CI runs:

# 1. Install all dependencies
uv sync --group qa --group dev

# 2. Run pre-commit on all files
uv run pre-commit run --all-files

# 3. Run tests with coverage
uv run pytest --cov=cmm_ai_automation

# 4. Build documentation (catches doc errors)
uv run mkdocs build

Integration Tests

Integration tests make real API calls and are skipped by default (some APIs block CI IPs):

# Run integration tests (requires network, API keys)
uv run pytest -m integration

# Run specific integration test file
uv run pytest tests/test_chebi.py -m integration

# Run both unit and integration tests
uv run pytest -m ""

API keys for integration tests:

CAS_API_KEY - CAS Common Chemistry API
Most other APIs (ChEBI, PubChem, MediaDive, NodeNormalization) work without keys

Coverage Targets

Current coverage configuration (see pyproject.toml):

Scripts are excluded from coverage (CLI entry points)
Target: 30% minimum (see issue #29 for roadmap to 60%)
Run uv run pytest --cov-report=term-missing to see uncovered lines

Credits

This project uses the template linkml-project-copier published as doi:10.5281/zenodo.15163584.

AI automation workflows adapted from ai4curation/github-ai-integrations (Monarch Initiative).

Related Projects

CultureBotAI/CMM-AI - AI-driven discovery for critical mineral metabolism research
Knowledge-Graph-Hub/kg-microbe - Knowledge graph for microbial data integration
biolink organization - Biolink Model ecosystem including:
- biolink/kgx - Knowledge Graph Exchange tools
- biolink/biolink-model - Schema and upper ontology
- biolink/biolink-model-toolkit - Python utilities
berkeleybop/metpo - Microbial Phenotype Ontology for phenotypic trait annotation
biopragmatics organization - Identifier and ontology tools including:
- bioregistry - Integrative registry of biological databases and ontologies
- curies - CURIE/URI conversion
- pyobo - Python package for ontologies and nomenclatures

Installation

Option 1: Use slash command in Claude Code

/install-skill https://github.com/turbomam/cmm-ai-automation

Option 2: Clone to skills directory

# Global (all projects)

git clone https://github.com/turbomam/cmm-ai-automation ~/.claude/skills/cmm-ai-automation

# Project-specific

git clone https://github.com/turbomam/cmm-ai-automation .claude/skills/cmm-ai-automation

Add MCP server to .cursor/mcp.json:

{
  "mcpServers": {
    "skillz": {
      "command": "npx",
      "args": ["-y", "skillz-mcp", "https://github.com/turbomam/cmm-ai-automation"]
    }
  }
}

Restart Cursor after adding the configuration.

Option 1: Use Gemini CLI command

gemini extensions install https://github.com/turbomam/cmm-ai-automation

Option 2: Clone to extensions directory

git clone https://github.com/turbomam/cmm-ai-automation ~/.gemini/extensions/cmm-ai-automation