Debug AWS resource issues, check Lambda logs, and monitor deployed services. Use when investigating production issues, checking CloudWatch logs, or debugging deployment failures.

13 stars
1 forks
9 views

SKILL.md


name: aws-monitoring description: Debug AWS resource issues, check Lambda logs, and monitor deployed services. Use when investigating production issues, checking CloudWatch logs, or debugging deployment failures. allowed-tools: Bash, Read, Grep

AWS Monitoring Skill

This skill helps you monitor and debug AWS resources for the SG Cars Trends platform.

When to Use This Skill

  • Investigating production errors
  • Checking Lambda function logs
  • Monitoring API performance
  • Debugging deployment failures
  • Analyzing CloudWatch metrics
  • Setting up alarms
  • Troubleshooting resource issues

Monitoring Tools

SST Console

SST provides a built-in console for monitoring:

# Open SST console for specific stage
npx sst console --stage production
npx sst console --stage staging
npx sst console --stage dev

Features:

  • Real-time Lambda logs
  • Function invocations
  • Error tracking
  • Resource overview
  • Environment variables

CloudWatch Logs

Access Lambda logs via CloudWatch:

# View logs using SST
npx sst logs --stage production

# View specific function logs
npx sst logs --stage production --function api

# Tail logs in real-time
npx sst logs --stage production --function api --tail

# Filter logs
npx sst logs --stage production --function api --filter "ERROR"

# Show logs from specific time
npx sst logs --stage production --function api --since 1h
npx sst logs --stage production --function api --since "2024-01-15 10:00"

AWS CLI

Use AWS CLI for advanced log queries:

# List log groups
aws logs describe-log-groups \
  --log-group-name-prefix "/aws/lambda/sgcarstrends"

# Get recent log streams
aws logs describe-log-streams \
  --log-group-name "/aws/lambda/sgcarstrends-api-production" \
  --order-by LastEventTime \
  --descending \
  --max-items 5

# Tail logs
aws logs tail "/aws/lambda/sgcarstrends-api-production" --follow

# Filter logs
aws logs filter-log-events \
  --log-group-name "/aws/lambda/sgcarstrends-api-production" \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000

# Get logs for specific request
aws logs filter-log-events \
  --log-group-name "/aws/lambda/sgcarstrends-api-production" \
  --filter-pattern "request-id-here"

CloudWatch Metrics

Lambda Metrics

# Get Lambda invocations
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Invocations \
  --dimensions Name=FunctionName,Value=sgcarstrends-api-production \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

# Get errors
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Errors \
  --dimensions Name=FunctionName,Value=sgcarstrends-api-production \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

# Get duration
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Duration \
  --dimensions Name=FunctionName,Value=sgcarstrends-api-production \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average,Maximum

API Gateway Metrics

# Get API requests
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApiGateway \
  --metric-name Count \
  --dimensions Name=ApiName,Value=sgcarstrends-api \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

# Get 4XX errors
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApiGateway \
  --metric-name 4XXError \
  --dimensions Name=ApiName,Value=sgcarstrends-api \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

# Get latency
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApiGateway \
  --metric-name Latency \
  --dimensions Name=ApiName,Value=sgcarstrends-api \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average,Maximum,p99

CloudWatch Alarms

Creating Alarms

// infra/alarms.ts
import { StackContext, use } from "sst/constructs";
import * as cloudwatch from "aws-cdk-lib/aws-cloudwatch";
import * as sns from "aws-cdk-lib/aws-sns";
import * as subscriptions from "aws-cdk-lib/aws-sns-subscriptions";
import { API } from "./api";

export function Alarms({ stack, app }: StackContext) {
  const { api } = use(API);

  // Only create alarms for production
  if (app.stage !== "production") {
    return;
  }

  // SNS topic for alarms
  const alarmTopic = new sns.Topic(stack, "AlarmTopic");

  // Add email subscription
  alarmTopic.addSubscription(
    new subscriptions.EmailSubscription("[email protected]")
  );

  // High error rate alarm
  new cloudwatch.Alarm(stack, "ApiHighErrorRate", {
    metric: api.metricErrors(),
    threshold: 10,
    evaluationPeriods: 2,
    datapointsToAlarm: 2,
    alarmDescription: "API has high error rate",
    treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
  }).addAlarmAction(new cloudwatch.SnsAction(alarmTopic));

  // High duration alarm
  new cloudwatch.Alarm(stack, "ApiHighDuration", {
    metric: api.metricDuration(),
    threshold: 5000, // 5 seconds
    evaluationPeriods: 2,
    datapointsToAlarm: 2,
    alarmDescription: "API response time is high",
    treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
  }).addAlarmAction(new cloudwatch.SnsAction(alarmTopic));

  // Throttle alarm
  new cloudwatch.Alarm(stack, "ApiThrottled", {
    metric: api.metricThrottles(),
    threshold: 1,
    evaluationPeriods: 1,
    alarmDescription: "API is being throttled",
    treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
  }).addAlarmAction(new cloudwatch.SnsAction(alarmTopic));
}

Add to SST config:

// infra/sst.config.ts
import { Alarms } from "./alarms";

export default {
  stacks(app) {
    app
      .stack(DNS)
      .stack(API)
      .stack(Web)
      .stack(Alarms); // Add alarms stack
  },
} satisfies SSTConfig;

Managing Alarms via CLI

# List alarms
aws cloudwatch describe-alarms

# Get alarm state
aws cloudwatch describe-alarms \
  --alarm-names "sgcarstrends-ApiHighErrorRate"

# Disable alarm
aws cloudwatch disable-alarm-actions \
  --alarm-names "sgcarstrends-ApiHighErrorRate"

# Enable alarm
aws cloudwatch enable-alarm-actions \
  --alarm-names "sgcarstrends-ApiHighErrorRate"

# Delete alarm
aws cloudwatch delete-alarms \
  --alarm-names "sgcarstrends-ApiHighErrorRate"

CloudWatch Insights

Querying Logs

# Start query
aws logs start-query \
  --log-group-name "/aws/lambda/sgcarstrends-api-production" \
  --start-time $(date -u -d '1 hour ago' +%s) \
  --end-time $(date -u +%s) \
  --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20'

# Get query results
aws logs get-query-results --query-id <query-id>

Common Queries

Find errors:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20

API performance:

fields @timestamp, @duration
| stats avg(@duration), max(@duration), min(@duration)

Count errors by type:

fields @message
| filter @message like /ERROR/
| parse @message /(?<errorType>\w+Error)/
| stats count() by errorType

Slow requests:

fields @timestamp, @duration, @requestId
| filter @duration > 1000
| sort @duration desc
| limit 20

Request rate:

fields @timestamp
| stats count() by bin(5m)

X-Ray Tracing

Enable X-Ray

// infra/api.ts
import { StackContext, Function } from "sst/constructs";
import * as lambda from "aws-cdk-lib/aws-lambda";

export function API({ stack }: StackContext) {
  const api = new Function(stack, "api", {
    handler: "apps/api/src/index.handler",
    tracing: lambda.Tracing.ACTIVE, // Enable X-Ray
  });

  return { api };
}

Instrument Code

// apps/api/src/index.ts
import { captureAWSv3Client } from "aws-xray-sdk-core";
import { DynamoDBClient } from "@aws-sdk/client-dynamodb";

// Wrap AWS SDK clients
const client = captureAWSv3Client(new DynamoDBClient({}));

View Traces

# Get service graph
aws xray get-service-graph \
  --start-time $(date -u -d '1 hour ago' +%s) \
  --end-time $(date -u +%s)

# Get trace summaries
aws xray get-trace-summaries \
  --start-time $(date -u -d '1 hour ago' +%s) \
  --end-time $(date -u +%s)

# Get trace details
aws xray batch-get-traces --trace-ids <trace-id>

Resource Monitoring

Lambda Functions

# List functions
aws lambda list-functions --query 'Functions[?starts_with(FunctionName, `sgcarstrends`)].FunctionName'

# Get function config
aws lambda get-function-configuration \
  --function-name sgcarstrends-api-production

# Get function code location
aws lambda get-function \
  --function-name sgcarstrends-api-production

# Invoke function
aws lambda invoke \
  --function-name sgcarstrends-api-production \
  --payload '{"path": "/health"}' \
  response.json

cat response.json

CloudFront Distributions

# List distributions
aws cloudfront list-distributions \
  --query 'DistributionList.Items[*].[Id,DomainName,Status]' \
  --output table

# Get distribution config
aws cloudfront get-distribution-config --id <distribution-id>

# Create invalidation (cache clear)
aws cloudfront create-invalidation \
  --distribution-id <distribution-id> \
  --paths "/*"

# List invalidations
aws cloudfront list-invalidations --distribution-id <distribution-id>

S3 Buckets

# List buckets
aws s3 ls

# Get bucket size
aws s3 ls s3://bucket-name --recursive --summarize | grep "Total Size"

# Monitor bucket metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/S3 \
  --metric-name BucketSizeBytes \
  --dimensions Name=BucketName,Value=bucket-name Name=StorageType,Value=StandardStorage \
  --start-time $(date -u -d '1 day ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Average

Cost Monitoring

Cost Explorer

# Get cost and usage
aws ce get-cost-and-usage \
  --time-period Start=$(date -u -d '1 month ago' +%Y-%m-%d),End=$(date -u +%Y-%m-%d) \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=SERVICE

# Get cost by tag
aws ce get-cost-and-usage \
  --time-period Start=$(date -u -d '1 month ago' +%Y-%m-%d),End=$(date -u +%Y-%m-%d) \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=TAG,Key=Environment

Budget Alerts

Create budget in AWS Console or via CLI:

# Create budget
aws budgets create-budget \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget file://budget.json \
  --notifications-with-subscribers file://notifications.json

Debugging Production Issues

1. Check Recent Deployments

# Get stack events
aws cloudformation describe-stack-events \
  --stack-name sgcarstrends-api-production \
  --max-items 50

# Get deployment status
npx sst stacks info API --stage production

2. Check Logs for Errors

# Get recent errors
npx sst logs --stage production --function api --filter "ERROR" --since 1h

# Or use AWS CLI
aws logs tail "/aws/lambda/sgcarstrends-api-production" \
  --follow \
  --filter-pattern "ERROR"

3. Check Metrics

# Check invocations and errors
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Invocations \
  --dimensions Name=FunctionName,Value=sgcarstrends-api-production \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

4. Test Endpoint

# Test API directly
curl -I https://api.sgcarstrends.com/health

# Test with verbose output
curl -v https://api.sgcarstrends.com/health

5. Check Resource Limits

# Check Lambda quotas
aws service-quotas get-service-quota \
  --service-code lambda \
  --quota-code L-B99A9384  # Concurrent executions

# Check API Gateway quotas
aws service-quotas list-service-quotas \
  --service-code apigateway

Common Issues

High Latency

Investigation:

  1. Check Lambda duration metrics
  2. Review CloudWatch Insights for slow queries
  3. Check database connection pool
  4. Review API response times

Solutions:

  • Increase Lambda memory
  • Optimize database queries
  • Add caching
  • Use connection pooling

High Error Rate

Investigation:

  1. Check error logs
  2. Review error types
  3. Check external service status
  4. Verify environment variables

Solutions:

  • Fix application bugs
  • Add error handling
  • Retry failed requests
  • Check API rate limits

Cold Starts

Investigation:

  1. Check init duration
  2. Review package size
  3. Check provisioned concurrency

Solutions:

  • Enable provisioned concurrency
  • Reduce bundle size
  • Use ARM architecture
  • Optimize imports

Monitoring Scripts

Health Check Script

#!/bin/bash
# scripts/health-check.sh

STAGE=${1:-production}
API_URL="https://api${STAGE:+.$STAGE}.sgcarstrends.com"

echo "Checking health of $STAGE environment..."

# Check API
API_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/health)

if [ $API_STATUS -eq 200 ]; then
  echo "✓ API is healthy"
else
  echo "✗ API is down (status: $API_STATUS)"
  exit 1
fi

# Check Web
WEB_URL="https://${STAGE:+$STAGE.}sgcarstrends.com"
WEB_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $WEB_URL)

if [ $WEB_STATUS -eq 200 ]; then
  echo "✓ Web is healthy"
else
  echo "✗ Web is down (status: $WEB_STATUS)"
  exit 1
fi

echo "All services are healthy!"

Run:

chmod +x scripts/health-check.sh
./scripts/health-check.sh production

Log Analysis Script

#!/bin/bash
# scripts/analyze-logs.sh

STAGE=${1:-production}
LOG_GROUP="/aws/lambda/sgcarstrends-api-$STAGE"

echo "Analyzing logs for $STAGE..."

# Count errors in last hour
ERROR_COUNT=$(aws logs filter-log-events \
  --log-group-name $LOG_GROUP \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --query 'events[*].message' \
  --output text | wc -l)

echo "Errors in last hour: $ERROR_COUNT"

# Get top errors
echo -e "\nTop error types:"
aws logs filter-log-events \
  --log-group-name $LOG_GROUP \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --query 'events[*].message' \
  --output text | \
  grep -oE '\w+Error' | \
  sort | uniq -c | sort -rn | head -5

References

Best Practices

  1. Log Levels: Use appropriate log levels (DEBUG, INFO, WARN, ERROR)
  2. Structured Logging: Use JSON format for easier parsing
  3. Correlation IDs: Track requests across services
  4. Alarms: Set up alarms for critical metrics
  5. Dashboards: Create CloudWatch dashboards for key metrics
  6. Cost Monitoring: Track AWS costs regularly
  7. Regular Reviews: Review logs and metrics weekly
  8. Retention: Set appropriate log retention (7-30 days)