github-repo-skill
turbomam/cmm-ai-automationGuide for creating new GitHub repos and best practice for existing GitHub repos, applicable to both code and non-code projects
SKILL.md
name: github-repo-skill description: Guide for creating new GitHub repos and best practice for existing GitHub repos, applicable to both code and non-code projects license: CC-0
github-repo-skill
overview
To create and maintain high-quality repos that conform to Mungall group / BBOP best practice, use this skill. Use this skill regardless of whether the repo is for code or non-code (ontology, linkml schemas, curated content, analyses, websites). Use this skill for both new repos, for migrating legacy repos, or for ongoing maintenance.
Principles
Follow existing copier templates
The Mungall group favors the use of copier and blesses the following templates:
- For LinkML schemas: https://github.com/linkml/linkml-project-copier
- For code: https://github.com/monarch-initiative/monarch-project-copier
- For ontologies: https://github.com/INCATools/ontology-development-kit (uses bespoke framework, not copier)
These should always be used for new repos. Pre-existing repos should try and follow these or migrate towards them.
Additionally the group uses additional drop-in templates for AI integrations:
Favored tools
These are included in the templates above but some general over-arching preferences:
- modern python dev stack:
uv,ruff(currentlymypyfor typing but we may switch to https://docs.astral.sh/ty/) - for builds, both
justandmakeare favored, withjustfavored for non-pipeline cases
Engineering best practice
- pydantic or pydantic generated from LinkML for data models and data access objects (dataclasses are fine for engine objects)
- always make sure function/method parameters and return objects are typed. Use mypy or equivalent to test.
- testing:
- follow TDD, use pytest-style tests,
@pytest.mark.parametrizeis good for combinatorial testing - always use doctests: make them informative for humans but also serving as additional tests
- ensure unit tests and tests that depend on external APIs, infrastructure etc are separated (e.g.
pytest.mark.integration) - for testing external APIs or services, use vcrpy
- do not create mock tests unless explicitly requested
- for data-oriented projects, yaml, tsvs, etc can go in
tests/inputor smilar - for schemas, follow the linkml copier template, and ensure schemas and example test data is validated
- for ontologies, follow ODK best practice and ensure ontologies are comprehensively axiomatized to allow for reasoner-based checking
- follow TDD, use pytest-style tests,
- jupyter notebooks are good for documentation, dashboards, and analysis, but ensure that core logic is separated out and has unit tests
- CLI:
- Every library should have a fully featured CLI
- typer is favored, but click is also good.
- CLIs, APIs, and MCPs should be shims on top of core logic
- have separate test for both core logic and CLIs.
- Use idiomatic options and clig conventions. Group standards:
-i/--input,-o/--output(default stdout),-f/--format(input format),-O/--output-format,-v/-vv - When testing Typer/Rich CLIs, set
NO_COLOR=1andTERM=dumbenv vars to avoid ANSI escape codes breaking string assertions in CI.
- Exceptions
- In general you should not need to worry about catching exceptions, although for a well-polished CLI some catching in the CLI layer is fine
- IMPORTANT: better to fail fast and know there is a problem than to defensively catch and carry on as if everything is OK (general principle: transparency)
Dependency management
uv addto add new dependencies (oruv add --devor similar for dev dependencies)- libraries should allow somewhat relaxed dependencies to avoid diamond dependency problems. applications and infra can pin more tightly.
Git and GitHub Practices
- always work on branches, commit early and often, make PRs early
- in general, one PR = one issue (avoid mixing orthogonal concerns). Always reference issues in commits/PR messages
- use
ghon command line for operations like finding issues, creating PRs - all the group copier templates include extensive github actions for ensuring PRs are high quality
- github repo should have adequate metadata, links to docs, tags
Source of truth
- always have a clear source of truth (SoT) for all content, with github favored
- where projects dictate SoT is google docs/sheets, use https://rclone.org/ to sync
Documentation
- markdown is always favored, but older sites may use sphinx
- Follow Diátaxis framework: tutorial, how-to, reference, explanation
- Use examples extensively - examples can double as tests
- frameworks: mkdocs is generally favored due to simplicity but sphinx is ok for older projects
- Every project must have a comprehensive up to date README.md (or the README.md can point to site generated from mkdocs)
- jupyter notebooks can serve as combined integration tests/docs, use mkdocs-jupyter, for CLI examples, use
%%bash - Formatting tips: lists should be preceded by a blank line to avoid formatting issues with mkdocs
README
cmm-ai-automation
AI-assisted automation for Critical Mineral Metabolism (CMM) data curation using LinkML, OBO Foundry tools, and Google Sheets integration.
Collaboration
This repository is developed in collaboration with CultureBotAI/CMM-AI, which focuses on AI-driven discovery of microorganisms relevant to critical mineral metabolism. While CMM-AI handles the biological discovery and analysis workflows, this repository provides:
- Schema-driven data modeling with LinkML
- Integration with private Google Sheets data sources
- OBO Foundry ontology tooling for semantic annotation
Integration with Knowledge Graph Ecosystem
This project integrates with several knowledge graph and ontology resources:
| Project | Integration |
|---|---|
| kg-microbe | Source of microbial knowledge graph data; CMM strains are linked via kg_node_ids |
| kgx | Knowledge graph exchange format for importing/exporting Biolink Model-compliant data |
| biolink-model | Schema and upper ontology for biological knowledge representation |
| biolink-model-toolkit | Python utilities for working with Biolink Model |
| metpo | Microbial Phenotype Ontology for annotating phenotypic traits of CMM-relevant organisms |
See also:
- biolink organization - Biolink Model ecosystem
- biopragmatics organization - Identifier and ontology tools including:
- bioregistry - Integrative registry of biological databases and ontologies
- curies - CURIE/URI conversion
- pyobo - Python package for ontologies and nomenclatures
OLS Embeddings for Semantic Search
This project can leverage pre-computed embeddings from the Ontology Lookup Service (OLS) for semantic search and term mapping. See cthoyt.com/2025/08/04/ontology-text-embeddings.html for background.
Local embeddings database:
- ~9.5 million term embeddings from OLS-registered ontologies
- Model: OpenAI
text-embedding-3-small(1536 dimensions) - Schema:
(ontologyId, entityType, iri, document, model, hash, embeddings) - Embeddings stored as JSON strings
Planned use cases:
- Search Google Sheets content (strain names, media ingredients) against ontology terms
- Generate candidate mappings for unmapped terms
- Create CMM-specific embedding subsets for faster search
Reference implementation: berkeleybop/metpo embeddings search code.
CMM-AI Data Sources and APIs
The collaborating CultureBotAI/CMM-AI project uses the following APIs and data sources:
NCBI APIs (via Biopython Entrez):
| API | Used For |
|---|---|
| Entrez esearch/efetch/esummary | Assembly, BioSample, Taxonomy, PubMed/PMC |
| PMC ID Converter | PMID to PMC ID resolution |
| GEO/SRA | Transcriptomics datasets |
Other APIs:
| API | Used For |
|---|---|
| KEGG REST | Metabolic pathways |
| PubChem REST | Chemical compounds |
| RCSB PDB | Protein structures |
| UniProt | Protein sequences and annotations |
Database links generated:
- Culture collections: ATCC, DSMZ, NCIMB
- MetaCyc pathways
- DOI resolution
- AlphaFold predictions
- JGI IMG/GOLD
Ontologies used: CHEBI, GO, ENVO, OBI, NCBITaxon, MIxS, RHEA, BAO
Related issues:
- CMM-AI #38 - Document how to obtain KG-Microbe database files
- CMM-AI #37 - Document sources for curated media data
- CMM-AI #16 - Document the 5 Data Sources in Schema
Features
- LinkML Schema: Data models for CMM microbial strain data
- Google Sheets Integration: Read/write access to private Google Sheets (e.g., BER CMM Data)
- AI Automation: GitHub Actions with Claude Code for issue triage, summarization, and code assistance
- OBO Foundry Tools: Integration with OLS (Ontology Lookup Service) for ontology term lookup
Quick Start
# Clone the repository
git clone https://github.com/turbomam/cmm-ai-automation.git
cd cmm-ai-automation
# Install dependencies with uv
uv sync
# Set up Google Sheets credentials (service account)
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account.json
# Or place credentials in default location
# ~/.config/gspread/service_account.json
Google Sheets Usage
from cmm_ai_automation.gsheets import get_sheet_data, list_worksheets
# List available tabs in the BER CMM spreadsheet
tabs = list_worksheets("BER CMM Data for AI - for editing")
print(tabs)
# Read data from a specific tab
df = get_sheet_data("BER CMM Data for AI - for editing", "media_ingredients")
print(df.head())
AI Integration
This repo includes GitHub Actions that respond to @claude mentions in issues and PRs:
- Issue triage and labeling
- Issue summarization
- Code assistance and PR reviews
Requires CLAUDE_CODE_OAUTH_TOKEN secret to be configured.
Documentation Website
https://turbomam.github.io/cmm-ai-automation
Repository Structure
- docs/ - mkdocs-managed documentation
- elements/ - generated schema documentation
- examples/ - Examples of using the schema
- project/ - project files (these files are auto-generated, do not edit)
- src/ - source files (edit these)
- cmm_ai_automation
- schema/ -- LinkML schema (edit this)
- datamodel/ -- generated Python datamodel
- cmm_ai_automation
- tests/ - Python tests
- data/ - Example data
Developer Tools
There are several pre-defined command-recipes available.
They are written for the command runner just. To list all pre-defined commands, run just or just --list.
Testing and Quality Assurance
Quick Start
# Install all dependencies including QA tools
uv sync --group qa
# Run unit tests (fast, no network)
uv run pytest
# Run with coverage report
uv run pytest --cov=cmm_ai_automation
Test Categories
| Command | What it runs | Speed |
|---|---|---|
uv run pytest |
Unit tests only (default) | ~1.5s |
uv run pytest -m integration |
Integration tests (real API calls) | Slower |
uv run pytest --cov=cmm_ai_automation |
Unit tests with coverage | ~9s |
uv run pytest --durations=20 |
Show slowest 20 tests | ~1.5s |
Pre-commit Hooks
Pre-commit hooks run automatically before each commit, catching issues early:
# Install pre-commit hooks (one-time setup)
uv sync --group qa
uv run pre-commit install
# Run all hooks manually on all files
uv run pre-commit run --all-files
# Run specific hook
uv run pre-commit run ruff --all-files
uv run pre-commit run mypy --all-files
Hooks included:
ruff- Fast Python linter and formatterruff-format- Code formattingmypy- Static type checkingyamllint- YAML lintingcodespell- Spell checkingtypos- Fast typo detectiondeptry- Dependency checkingcheck-yaml,end-of-file-fixer,trailing-whitespace- General file hygiene
Running Individual QA Tools
# Linting with ruff
uv run ruff check src/
uv run ruff check --fix src/ # Auto-fix issues
# Type checking with mypy
uv run mypy src/cmm_ai_automation/
# Format code
uv run ruff format src/
# Check dependencies
uv run deptry src/
Thorough QA Check (CI-equivalent)
Run everything that CI runs:
# 1. Install all dependencies
uv sync --group qa --group dev
# 2. Run pre-commit on all files
uv run pre-commit run --all-files
# 3. Run tests with coverage
uv run pytest --cov=cmm_ai_automation
# 4. Build documentation (catches doc errors)
uv run mkdocs build
Integration Tests
Integration tests make real API calls and are skipped by default (some APIs block CI IPs):
# Run integration tests (requires network, API keys)
uv run pytest -m integration
# Run specific integration test file
uv run pytest tests/test_chebi.py -m integration
# Run both unit and integration tests
uv run pytest -m ""
API keys for integration tests:
CAS_API_KEY- CAS Common Chemistry API- Most other APIs (ChEBI, PubChem, MediaDive, NodeNormalization) work without keys
Coverage Targets
Current coverage configuration (see pyproject.toml):
- Scripts are excluded from coverage (CLI entry points)
- Target: 30% minimum (see issue #29 for roadmap to 60%)
- Run
uv run pytest --cov-report=term-missingto see uncovered lines
Credits
This project uses the template linkml-project-copier published as doi:10.5281/zenodo.15163584.
AI automation workflows adapted from ai4curation/github-ai-integrations (Monarch Initiative).
Related Projects
- CultureBotAI/CMM-AI - AI-driven discovery for critical mineral metabolism research
- Knowledge-Graph-Hub/kg-microbe - Knowledge graph for microbial data integration
- biolink organization - Biolink Model ecosystem including:
- biolink/kgx - Knowledge Graph Exchange tools
- biolink/biolink-model - Schema and upper ontology
- biolink/biolink-model-toolkit - Python utilities
- berkeleybop/metpo - Microbial Phenotype Ontology for phenotypic trait annotation
- biopragmatics organization - Identifier and ontology tools including:
- bioregistry - Integrative registry of biological databases and ontologies
- curies - CURIE/URI conversion
- pyobo - Python package for ontologies and nomenclatures