Monitoring
153 skills in DevOps > Monitoring
server-management
Server management procedures including PM2, monitoring, and log management. CRITICAL for production operations.
monitoring-observability
Implement comprehensive monitoring, logging, metrics, tracing, and alerting for production applications to ensure reliability and quick incident response. Use when setting up application monitoring, implementing structured logging, creating metrics and dashboards, setting up alerts, implementing distributed tracing, monitoring performance, tracking errors, or building observability into applications.
sentry-performance-monitoring
Use when setting up performance monitoring, distributed tracing, or profiling with Sentry. Covers transactions, spans, and performance insights.
sre-monitoring-and-observability
Use when building comprehensive monitoring and observability systems.
aws-cost-operations
This skill provides AWS cost optimization, monitoring, and operational best practices with integrated MCP servers for billing analysis, cost estimation, observability, and security assessment.
prometheus-monitoring
Set up Prometheus monitoring for applications with custom metrics, scraping configurations, and service discovery. Use when implementing time-series metrics collection, monitoring applications, or building observability infrastructure.
correlation-tracing
Implement distributed tracing with correlation IDs, trace propagation, and span tracking across microservices. Use when debugging distributed systems, monitoring request flows, or implementing observability.
log-aggregation
Implement centralized logging with ELK Stack, Loki, or Splunk for log collection, parsing, storage, and analysis across infrastructure.
dev-sre
Gate 2 of the development cycle. VALIDATES that observability was correctly implementedby developers. Does NOT implement observability code - only validates it.
infrastructure-monitoring
Set up comprehensive infrastructure monitoring with Prometheus, Grafana, and alerting systems for metrics, health checks, and performance tracking.
qa-observability
Production observability and performance engineering with OpenTelemetry, distributed tracing, metrics, logging, SLO/SLI design, capacity planning, performance profiling, APM integration, and observability maturity progression for modern cloud-native systems.
performance-monitor
Expert performance monitor specializing in system-wide metrics collection, analysis, and optimization. Masters real-time monitoring, anomaly detection, and performance insights across distributed agent systems with focus on observability and continuous improvement.
site-reliability-engineer
Production monitoring, observability, SLO/SLI management, and incident response.Trigger terms: monitoring, observability, SRE, site reliability, alerting, incident response,SLO, SLI, error budget, Prometheus, Grafana, Datadog, New Relic, ELK stack, logs, metrics,traces, on-call, production monitoring, health checks, uptime, availability, dashboards,post-mortem, incident management, runbook.Completes SDD Stage 8 (Monitoring) with comprehensive production observability:- SLI/SLO definitions and tracking- Monitoring stack setup (Prometheus, Grafana, ELK, Datadog, etc.)- Alert rules and notification channels- Incident response runbooks- Observability dashboards (logs, metrics, traces)- Post-mortem templates and analysis- Health check endpoints- Error budget trackingUse when: user needs production monitoring, observability platform, alerting, SLOs,incident response, or post-deployment health tracking.
site-reliability-engineer
Production monitoring, observability, SLO/SLI management, and incident response.Trigger terms: monitoring, observability, SRE, site reliability, alerting, incident response,SLO, SLI, error budget, Prometheus, Grafana, Datadog, New Relic, ELK stack, logs, metrics,traces, on-call, production monitoring, health checks, uptime, availability, dashboards,post-mortem, incident management, runbook.Completes SDD Stage 8 (Monitoring) with comprehensive production observability:- SLI/SLO definitions and tracking- Monitoring stack setup (Prometheus, Grafana, ELK, Datadog, etc.)- Alert rules and notification channels- Incident response runbooks- Observability dashboards (logs, metrics, traces)- Post-mortem templates and analysis- Health check endpoints- Error budget trackingUse when: user needs production monitoring, observability platform, alerting, SLOs,incident response, or post-deployment health tracking.
grey-haven-observability-engineering
Production-ready monitoring, logging, and tracing using Prometheus, Grafana, OpenTelemetry, DataDog, and Sentry. Use when setting up production monitoring, implementing SLOs, distributed tracing, or performance tracking.
observability-instrumentation
Comprehensive observability methodology implementing three pillars (logs, metrics, traces) with structured logging using Go slog, Prometheus-style metrics, and distributed tracing patterns. Use when adding observability from scratch, logs unstructured or inadequate, no metrics collection, debugging production issues difficult, or need performance monitoring. Provides structured logging patterns (contextual logging, log levels DEBUG/INFO/WARN/ERROR, request ID propagation), metrics instrumentation (counter/gauge/histogram patterns, Prometheus exposition), tracing setup (span creation, context propagation, sampling strategies), and Go slog best practices (JSON formatting, attribute management, handler configuration). Validated in meta-cc with 23-46x speedup vs ad-hoc logging, 90-95% transferability across languages (slog specific to Go but patterns universal).
ln-367-observability-auditor
Observability audit worker (L3). Checks structured logging, health check endpoints, metrics collection, request tracing, log levels. Returns findings with severity, location, effort, recommendations.
monitoring-expert
Use when setting up monitoring systems, logging, metrics, tracing, or alerting. Invoke for dashboards, Prometheus/Grafana, load testing, profiling, capacity planning. Keywords: monitoring, observability, logging, metrics, tracing, alerting, Prometheus, Grafana.
monitoring
Monitoring standards for monitoring in Devops environments. Covers best
agent-performance-monitor
Expert performance monitor specializing in system-wide metrics collection, analysis, and optimization. Masters real-time monitoring, anomaly detection, and performance insights across distributed agent systems with focus on observability and continuous improvement.