Monitoring and Alerting¶

FinOps Optimizer provides comprehensive monitoring and alerting capabilities to help you track cost optimization performance and respond to issues proactively.

Overview¶

The monitoring system tracks:

Cost metrics: Real-time cost tracking and trend analysis
Optimization performance: Savings achieved and recommendations applied
Resource utilization: CPU, memory, and storage usage patterns
Alert conditions: Cost spikes, underutilized resources, and security issues
System health: API availability and performance metrics

Configuration¶

Basic Monitoring Setup¶

monitoring:
  enabled: true
  metrics_retention_days: 30
  alert_threshold: 20.0
  check_interval_minutes: 15

  # Cost monitoring
  cost_monitoring:
    enabled: true
    daily_budget: 1000.0
    weekly_budget: 7000.0
    monthly_budget: 30000.0

  # Performance monitoring
  performance_monitoring:
    enabled: true
    cpu_threshold: 80.0
    memory_threshold: 85.0
    storage_threshold: 90.0

  # Security monitoring
  security_monitoring:
    enabled: true
    check_public_resources: true
    check_unencrypted_storage: true
    check_iam_permissions: true

Advanced Configuration¶

monitoring:
  # Custom alert rules
  alert_rules:
    - name: "cost_spike"
      condition: "daily_cost > 1.5 * avg_daily_cost"
      severity: "high"
      channels: ["email", "slack"]

    - name: "underutilized_resource"
      condition: "cpu_utilization < 10 and cost_per_hour > 1.0"
      severity: "medium"
      channels: ["email"]

    - name: "security_violation"
      condition: "public_resource_detected or unencrypted_storage"
      severity: "critical"
      channels: ["email", "slack", "pagerduty"]

  # Notification channels
  notifications:
    email:
      enabled: true
      recipients: ["finops@company.com", "devops@company.com"]
      smtp_server: "smtp.company.com"
      smtp_port: 587

    slack:
      enabled: true
      webhook_url: "https://hooks.slack.com/services/..."
      channel: "#finops-alerts"

    pagerduty:
      enabled: true
      service_key: "your-service-key"

Usage¶

Starting Monitoring¶

from finops import FinOpsOptimizer

optimizer = FinOpsOptimizer(config_path="config.yaml")

# Start monitoring
optimizer.start_monitoring()

# Check status
status = optimizer.get_monitoring_status()
print(f"Monitoring active: {status.active}")
print(f"Last check: {status.last_check}")
print(f"Active alerts: {len(status.alerts)}")

Monitoring Status¶

# Get detailed monitoring status
status = optimizer.get_monitoring_status()

print("=== Monitoring Status ===")
print(f"Active: {status.active}")
print(f"Last check: {status.last_check}")
print(f"Next check: {status.next_check}")
print(f"Metrics collected: {status.metrics_count}")
print(f"Active alerts: {len(status.alerts)}")

# Check specific metrics
cost_metrics = optimizer.get_cost_metrics(days=7)
print(f"Average daily cost: ${cost_metrics.average_daily_cost:.2f}")
print(f"Cost trend: {cost_metrics.trend}")

utilization_metrics = optimizer.get_utilization_metrics()
print(f"Average CPU utilization: {utilization_metrics.avg_cpu:.1f}%")
print(f"Average memory utilization: {utilization_metrics.avg_memory:.1f}%")

Alert Management¶

# Get active alerts
alerts = optimizer.get_active_alerts()

for alert in alerts:
    print(f"Alert: {alert.name}")
    print(f"Severity: {alert.severity}")
    print(f"Message: {alert.message}")
    print(f"Created: {alert.created_at}")
    print(f"Status: {alert.status}")
    print("---")

# Acknowledge alert
optimizer.acknowledge_alert(alert_id="alert-123")

# Resolve alert
optimizer.resolve_alert(alert_id="alert-123", resolution="Cost spike was due to legitimate traffic increase")

Custom Alert Rules¶

# Create custom alert rule
custom_rule = {
    "name": "custom_cost_threshold",
    "condition": "daily_cost > 2000.0",
    "severity": "high",
    "channels": ["email"],
    "description": "Daily cost exceeded $2000 threshold"
}

optimizer.add_alert_rule(custom_rule)

# List all alert rules
rules = optimizer.get_alert_rules()
for rule in rules:
    print(f"Rule: {rule.name}")
    print(f"Condition: {rule.condition}")
    print(f"Severity: {rule.severity}")
    print("---")

Metrics and Dashboards¶

Available Metrics¶

Cost Metrics
Daily, weekly, monthly costs
Cost trends and forecasts
Savings achieved
Cost by service, region, tag
Performance Metrics
CPU utilization
Memory utilization
Storage utilization
Network usage
Optimization Metrics
Recommendations generated
Recommendations applied
Savings realized
Optimization success rate
Security Metrics
Public resources detected
Unencrypted storage
IAM permission issues
Compliance violations

Dashboard Access¶

# Generate monitoring dashboard
dashboard = optimizer.generate_monitoring_dashboard(
    time_range="last_30_days",
    include_metrics=["cost", "performance", "optimization", "security"]
)

print(f"Dashboard URL: {dashboard.url}")
print(f"Dashboard file: {dashboard.file_path}")

Custom Dashboards¶

# Create custom dashboard
custom_dashboard = {
    "name": "Executive Summary",
    "metrics": [
        {
            "name": "Total Monthly Cost",
            "type": "cost",
            "aggregation": "sum",
            "period": "month"
        },
        {
            "name": "Savings Achieved",
            "type": "optimization",
            "aggregation": "sum",
            "period": "month"
        },
        {
            "name": "Resource Utilization",
            "type": "performance",
            "aggregation": "average",
            "period": "day"
        }
    ],
    "layout": "grid",
    "refresh_interval": 300
}

dashboard = optimizer.create_custom_dashboard(custom_dashboard)

Integration with External Systems¶

Prometheus Integration¶

# Export metrics to Prometheus
optimizer.export_metrics_to_prometheus(
    endpoint="http://prometheus:9090",
    job_name="finops-optimizer"
)

Grafana Integration¶

# Create Grafana dashboard
grafana_dashboard = optimizer.create_grafana_dashboard(
    grafana_url="http://grafana:3000",
    api_key="your-grafana-api-key"
)

Slack Integration¶

# Send custom notification to Slack
optimizer.send_slack_notification(
    message="Cost optimization completed",
    channel="#finops",
    attachments=[
        {
            "title": "Monthly Savings",
            "value": "$1,234.56",
            "color": "good"
        }
    ]
)

Best Practices¶

1. Set Appropriate Thresholds¶

Start with conservative thresholds
Adjust based on historical data
Consider business context

2. Use Multiple Channels¶

Email for important alerts
Slack for team notifications
PagerDuty for critical issues

3. Regular Review¶

Review alert effectiveness weekly
Adjust rules based on false positives
Archive resolved alerts

4. Performance Optimization¶

Use appropriate check intervals
Implement metric retention policies
Monitor monitoring system performance

5. Security Considerations¶

Secure notification channels
Implement alert authentication
Audit alert access

Troubleshooting¶

Common Issues¶

High Alert Volume
Adjust thresholds
Implement alert grouping
Use alert suppression
Missing Metrics
Check API permissions
Verify data sources
Review collection intervals
Performance Issues
Reduce check frequency
Implement caching
Use async processing

Getting Help¶

Check the Troubleshooting Guide
Review monitoring logs
Consult API Reference
Open an issue on GitHub

For more information, see the Configuration Guide and Security Best Practices.