Guide for creating new GitHub repos and best practice for existing GitHub repos, applicable to both code and non-code projects

0 stars
0 forks
Python
15 views

SKILL.md


name: github-repo-skill description: Guide for creating new GitHub repos and best practice for existing GitHub repos, applicable to both code and non-code projects license: CC-0

github-repo-skill

overview

To create and maintain high-quality repos that conform to Mungall group / BBOP best practice, use this skill. Use this skill regardless of whether the repo is for code or non-code (ontology, linkml schemas, curated content, analyses, websites). Use this skill for both new repos, for migrating legacy repos, or for ongoing maintenance.


Principles

Follow existing copier templates

The Mungall group favors the use of copier and blesses the following templates:

These should always be used for new repos. Pre-existing repos should try and follow these or migrate towards them.

Additionally the group uses additional drop-in templates for AI integrations:

Favored tools

These are included in the templates above but some general over-arching preferences:

  • modern python dev stack: uv, ruff (currently mypy for typing but we may switch to https://docs.astral.sh/ty/)
  • for builds, both just and make are favored, with just favored for non-pipeline cases

Engineering best practice

  • pydantic or pydantic generated from LinkML for data models and data access objects (dataclasses are fine for engine objects)
  • always make sure function/method parameters and return objects are typed. Use mypy or equivalent to test.
  • testing:
    • follow TDD, use pytest-style tests, @pytest.mark.parametrize is good for combinatorial testing
    • always use doctests: make them informative for humans but also serving as additional tests
    • ensure unit tests and tests that depend on external APIs, infrastructure etc are separated (e.g. pytest.mark.integration)
    • for testing external APIs or services, use vcrpy
    • do not create mock tests unless explicitly requested
    • for data-oriented projects, yaml, tsvs, etc can go in tests/input or smilar
    • for schemas, follow the linkml copier template, and ensure schemas and example test data is validated
    • for ontologies, follow ODK best practice and ensure ontologies are comprehensively axiomatized to allow for reasoner-based checking
  • jupyter notebooks are good for documentation, dashboards, and analysis, but ensure that core logic is separated out and has unit tests
  • CLI:
    • Every library should have a fully featured CLI
    • typer is favored, but click is also good.
    • CLIs, APIs, and MCPs should be shims on top of core logic
    • have separate test for both core logic and CLIs.
    • Use idiomatic options and clig conventions. Group standards: -i/--input, -o/--output (default stdout), -f/--format (input format), -O/--output-format, -v/-vv
    • When testing Typer/Rich CLIs, set NO_COLOR=1 and TERM=dumb env vars to avoid ANSI escape codes breaking string assertions in CI.
  • Exceptions
    • In general you should not need to worry about catching exceptions, although for a well-polished CLI some catching in the CLI layer is fine
    • IMPORTANT: better to fail fast and know there is a problem than to defensively catch and carry on as if everything is OK (general principle: transparency)

Dependency management

  • uv add to add new dependencies (or uv add --dev or similar for dev dependencies)
  • libraries should allow somewhat relaxed dependencies to avoid diamond dependency problems. applications and infra can pin more tightly.

Git and GitHub Practices

  • always work on branches, commit early and often, make PRs early
  • in general, one PR = one issue (avoid mixing orthogonal concerns). Always reference issues in commits/PR messages
  • use gh on command line for operations like finding issues, creating PRs
  • all the group copier templates include extensive github actions for ensuring PRs are high quality
  • github repo should have adequate metadata, links to docs, tags

Source of truth

  • always have a clear source of truth (SoT) for all content, with github favored
  • where projects dictate SoT is google docs/sheets, use https://rclone.org/ to sync

Documentation

  • markdown is always favored, but older sites may use sphinx
  • Follow Diátaxis framework: tutorial, how-to, reference, explanation
  • Use examples extensively - examples can double as tests
  • frameworks: mkdocs is generally favored due to simplicity but sphinx is ok for older projects
  • Every project must have a comprehensive up to date README.md (or the README.md can point to site generated from mkdocs)
  • jupyter notebooks can serve as combined integration tests/docs, use mkdocs-jupyter, for CLI examples, use %%bash
  • Formatting tips: lists should be preceded by a blank line to avoid formatting issues with mkdocs

README

Copier Badge

cmm-ai-automation

AI-assisted automation for Critical Mineral Metabolism (CMM) data curation using LinkML, OBO Foundry tools, and Google Sheets integration.

Collaboration

This repository is developed in collaboration with CultureBotAI/CMM-AI, which focuses on AI-driven discovery of microorganisms relevant to critical mineral metabolism. While CMM-AI handles the biological discovery and analysis workflows, this repository provides:

  • Schema-driven data modeling with LinkML
  • Integration with private Google Sheets data sources
  • OBO Foundry ontology tooling for semantic annotation

Integration with Knowledge Graph Ecosystem

This project integrates with several knowledge graph and ontology resources:

Project Integration
kg-microbe Source of microbial knowledge graph data; CMM strains are linked via kg_node_ids
kgx Knowledge graph exchange format for importing/exporting Biolink Model-compliant data
biolink-model Schema and upper ontology for biological knowledge representation
biolink-model-toolkit Python utilities for working with Biolink Model
metpo Microbial Phenotype Ontology for annotating phenotypic traits of CMM-relevant organisms

See also:

OLS Embeddings for Semantic Search

This project can leverage pre-computed embeddings from the Ontology Lookup Service (OLS) for semantic search and term mapping. See cthoyt.com/2025/08/04/ontology-text-embeddings.html for background.

Local embeddings database:

  • ~9.5 million term embeddings from OLS-registered ontologies
  • Model: OpenAI text-embedding-3-small (1536 dimensions)
  • Schema: (ontologyId, entityType, iri, document, model, hash, embeddings)
  • Embeddings stored as JSON strings

Planned use cases:

  • Search Google Sheets content (strain names, media ingredients) against ontology terms
  • Generate candidate mappings for unmapped terms
  • Create CMM-specific embedding subsets for faster search

Reference implementation: berkeleybop/metpo embeddings search code.

CMM-AI Data Sources and APIs

The collaborating CultureBotAI/CMM-AI project uses the following APIs and data sources:

NCBI APIs (via Biopython Entrez):

API Used For
Entrez esearch/efetch/esummary Assembly, BioSample, Taxonomy, PubMed/PMC
PMC ID Converter PMID to PMC ID resolution
GEO/SRA Transcriptomics datasets

Other APIs:

API Used For
KEGG REST Metabolic pathways
PubChem REST Chemical compounds
RCSB PDB Protein structures
UniProt Protein sequences and annotations

Database links generated:

  • Culture collections: ATCC, DSMZ, NCIMB
  • MetaCyc pathways
  • DOI resolution
  • AlphaFold predictions
  • JGI IMG/GOLD

Ontologies used: CHEBI, GO, ENVO, OBI, NCBITaxon, MIxS, RHEA, BAO

Related issues:

  • CMM-AI #38 - Document how to obtain KG-Microbe database files
  • CMM-AI #37 - Document sources for curated media data
  • CMM-AI #16 - Document the 5 Data Sources in Schema

Features

  • LinkML Schema: Data models for CMM microbial strain data
  • Google Sheets Integration: Read/write access to private Google Sheets (e.g., BER CMM Data)
  • AI Automation: GitHub Actions with Claude Code for issue triage, summarization, and code assistance
  • OBO Foundry Tools: Integration with OLS (Ontology Lookup Service) for ontology term lookup

Quick Start

# Clone the repository
git clone https://github.com/turbomam/cmm-ai-automation.git
cd cmm-ai-automation

# Install dependencies with uv
uv sync

# Set up Google Sheets credentials (service account)
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account.json

# Or place credentials in default location
# ~/.config/gspread/service_account.json

Google Sheets Usage

from cmm_ai_automation.gsheets import get_sheet_data, list_worksheets

# List available tabs in the BER CMM spreadsheet
tabs = list_worksheets("BER CMM Data for AI - for editing")
print(tabs)

# Read data from a specific tab
df = get_sheet_data("BER CMM Data for AI - for editing", "media_ingredients")
print(df.head())

AI Integration

This repo includes GitHub Actions that respond to @claude mentions in issues and PRs:

  • Issue triage and labeling
  • Issue summarization
  • Code assistance and PR reviews

Requires CLAUDE_CODE_OAUTH_TOKEN secret to be configured.

Documentation Website

https://turbomam.github.io/cmm-ai-automation

Repository Structure

Developer Tools

There are several pre-defined command-recipes available. They are written for the command runner just. To list all pre-defined commands, run just or just --list.

Testing and Quality Assurance

Quick Start

# Install all dependencies including QA tools
uv sync --group qa

# Run unit tests (fast, no network)
uv run pytest

# Run with coverage report
uv run pytest --cov=cmm_ai_automation

Test Categories

Command What it runs Speed
uv run pytest Unit tests only (default) ~1.5s
uv run pytest -m integration Integration tests (real API calls) Slower
uv run pytest --cov=cmm_ai_automation Unit tests with coverage ~9s
uv run pytest --durations=20 Show slowest 20 tests ~1.5s

Pre-commit Hooks

Pre-commit hooks run automatically before each commit, catching issues early:

# Install pre-commit hooks (one-time setup)
uv sync --group qa
uv run pre-commit install

# Run all hooks manually on all files
uv run pre-commit run --all-files

# Run specific hook
uv run pre-commit run ruff --all-files
uv run pre-commit run mypy --all-files

Hooks included:

  • ruff - Fast Python linter and formatter
  • ruff-format - Code formatting
  • mypy - Static type checking
  • yamllint - YAML linting
  • codespell - Spell checking
  • typos - Fast typo detection
  • deptry - Dependency checking
  • check-yaml, end-of-file-fixer, trailing-whitespace - General file hygiene

Running Individual QA Tools

# Linting with ruff
uv run ruff check src/
uv run ruff check --fix src/  # Auto-fix issues

# Type checking with mypy
uv run mypy src/cmm_ai_automation/

# Format code
uv run ruff format src/

# Check dependencies
uv run deptry src/

Thorough QA Check (CI-equivalent)

Run everything that CI runs:

# 1. Install all dependencies
uv sync --group qa --group dev

# 2. Run pre-commit on all files
uv run pre-commit run --all-files

# 3. Run tests with coverage
uv run pytest --cov=cmm_ai_automation

# 4. Build documentation (catches doc errors)
uv run mkdocs build

Integration Tests

Integration tests make real API calls and are skipped by default (some APIs block CI IPs):

# Run integration tests (requires network, API keys)
uv run pytest -m integration

# Run specific integration test file
uv run pytest tests/test_chebi.py -m integration

# Run both unit and integration tests
uv run pytest -m ""

API keys for integration tests:

  • CAS_API_KEY - CAS Common Chemistry API
  • Most other APIs (ChEBI, PubChem, MediaDive, NodeNormalization) work without keys

Coverage Targets

Current coverage configuration (see pyproject.toml):

  • Scripts are excluded from coverage (CLI entry points)
  • Target: 30% minimum (see issue #29 for roadmap to 60%)
  • Run uv run pytest --cov-report=term-missing to see uncovered lines

Credits

This project uses the template linkml-project-copier published as doi:10.5281/zenodo.15163584.

AI automation workflows adapted from ai4curation/github-ai-integrations (Monarch Initiative).

Related Projects