Chapter 8: Monitoring, Observability & Alerting¶
Seeing Everything – The Foundation for AI Agents
Part of: The DevOps Engineer's Guide to Effective AI Usage
Table of Contents¶
- Executive Summary – Why Monitoring Matters for AI
- Part 1: Monitoring vs. Observability – Understanding the Difference
- Part 2: Monitoring Architecture – What to Monitor and How
- Part 3: Alerting Strategy – When to Alert and Who to Notify
- Part 4: Dashboards & Visualization – Making Data Actionable
- Part 5: AI Agent Monitoring – Special Considerations for Chapter 10
- Part 6: VSCode Integration for Monitoring Workflows
- Part 7: Iteration Points – Your Feedback Needed
- Appendix: Monitoring Templates & Configurations
1. Executive Summary – Why Monitoring Matters for AI ¶
The Hard Truth About Monitoring¶
┌─────────────────────────────────────────────────────────────┐
│ WHY MONITORING MATTERS FOR AI AGENTS │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Without Monitoring] │
│ • You can't see what's broken │
│ • AI Agents operate in the dark │
│ • Incidents detected by customers │
│ • No data for AI Agents to learn from │
│ • No audit trail for compliance │
│ │
│ [With Monitoring] │
│ • You see problems before customers do │
│ • AI Agents have data to make decisions │
│ • Incidents detected and resolved quickly │
│ • AI Agents learn from historical data │
│ • Full audit trail for compliance │
│ │
│ [Key Insight] │
│ Chapters 3-7 built the structure and guardrails │
│ Chapter 8 provides the visibility │
│ Chapter 10 AI Agents need this visibility to operate │
│ │
└─────────────────────────────────────────────────────────────┘
Why This Chapter Exists¶
Chapter 3 taught you: Structured IaC (InfraCtl)
Chapter 4 taught you: Structured Deployment (Ansible)
Chapter 5 taught you: Structured CI/CD (Pipelines + Runners)
Chapter 6 taught you: Production Deployment & Release Management
Chapter 7 taught you: Governance, Safety & Compliance
Chapter 8 teaches you: Monitoring, Observability & Alerting – the visibility that makes Chapters 3-7 (and eventually Chapter 10 AI Agents) observable and accountable
Chapter 10 will teach you: AI Agents that USE this monitoring data to make decisions
The Core Thesis¶
"You can't automate what you can't observe. This chapter provides the monitoring, observability, and alerting foundation that Chapters 3-7 operate within, and that Chapter 10 AI Agents need to make informed decisions."
What You'll Learn¶
| Section | What You'll Gain | Why It Matters |
|---|---|---|
| Part 1: Monitoring vs. Observability | Understand the difference | Choose the right tools |
| Part 2: Monitoring Architecture | What to monitor and how | Comprehensive visibility |
| Part 3: Alerting Strategy | When to alert and who | Avoid alert fatigue |
| Part 4: Dashboards | Make data actionable | Quick decision-making |
| Part 5: AI Agent Monitoring | Special considerations | Chapter 10 preparation |
| Part 6: VSCode Integration | Integrate monitoring into workflows | Daily productivity |
2. Part 1: Monitoring vs. Observability – Understanding the Difference ¶
2.1 The Key Distinction¶
┌─────────────────────────────────────────────────────────────┐
│ MONITORING vs. OBSERVABILITY │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Monitoring] │
│ • WHAT: Known unknowns │
│ • Question: "Is the system working?" │
│ • Approach: Pre-defined metrics and alerts │
│ • Example: CPU > 80% → alert │
│ • Best for: Known failure modes │
│ │
│ [Observability] │
│ • WHAT: Unknown unknowns │
│ • Question: "Why is the system broken?" │
│ • Approach: Logs, metrics, traces (three pillars) │
│ • Example: Query any metric, correlate across services │
│ • Best for: Complex, distributed systems │
│ │
│ [The Relationship] │
│ Monitoring is a subset of observability │
│ You need both for production readiness │
│ AI Agents need observability to make good decisions │
│ │
└─────────────────────────────────────────────────────────────┘
2.2 The Three Pillars of Observability¶
┌─────────────────────────────────────────────────────────────┐
│ THREE PILLARS OF OBSERVABILITY │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Pillar 1: Metrics] │
│ • WHAT: Numerical measurements over time │
│ • Examples: CPU usage, memory, request rate, error rate │
│ • Tools: Prometheus, Datadog, CloudWatch │
│ • AI Agent Use: Decision thresholds, anomaly detection │
│ │
│ [Pillar 2: Logs] │
│ • WHAT: Timestamped records of events │
│ • Examples: Application logs, access logs, audit logs │
│ • Tools: ELK Stack, Splunk, CloudWatch Logs │
│ • AI Agent Use: Root cause analysis, pattern detection │
│ │
│ [Pillar 3: Traces] │
│ • WHAT: Request flow across services │
│ • Examples: Distributed traces, span data │
│ • Tools: Jaeger, Zipkin, AWS X-Ray │
│ • AI Agent Use: Service dependency mapping, latency analysis│
│ │
│ [You Need All Three] │
│ Metrics: Tell you WHAT is happening │
│ Logs: Tell you WHY it's happening │
│ Traces: Tell you WHERE it's happening │
│ │
└─────────────────────────────────────────────────────────────┘
2.3 Monitoring Maturity Levels¶
┌─────────────────────────────────────────────────────────────┐
│ MONITORING MATURITY LEVELS │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Level 1: Reactive] │
│ • Monitor: Nothing until something breaks │
│ • Alert: Customers report issues │
│ • Response: Firefighting │
│ • AI Agent Role: Not ready for AI Agents │
│ │
│ [Level 2: Proactive] │
│ • Monitor: Key metrics (CPU, memory, disk) │
│ • Alert: Threshold-based alerts │
│ • Response: On-call responds to alerts │
│ • AI Agent Role: Basic monitoring, human decides │
│ │
│ [Level 3: Predictive] │
│ • Monitor: Business metrics + technical metrics │
│ • Alert: Anomaly detection, trend analysis │
│ • Response: Prevent issues before they happen │
│ • AI Agent Role: AI can recommend based on trends │
│ │
│ [Level 4: Autonomous] │
│ • Monitor: Full observability (metrics, logs, traces) │
│ • Alert: AI-driven alerting, smart correlation │
│ • Response: AI Agents auto-remediate low-risk issues │
│ • AI Agent Role: Chapter 10 ready │
│ │
│ [Recommendation] │
│ Aim for Level 3 before implementing AI Agents (Level 4) │
│ │
└─────────────────────────────────────────────────────────────┘
2.4 Monitoring Requirements by Environment¶
| Environment | Monitoring Level | Alerting | Retention | AI Agent Access |
|---|---|---|---|---|
| Development | Basic metrics | Email only | 30 days | Full access |
| Staging | Enhanced metrics + logs | Slack + email | 90 days | Full access |
| Production | Full observability (3 pillars) | PagerDuty + Slack + email | 7 years | Read-only, write with approval |
3. Part 2: Monitoring Architecture – What to Monitor and How ¶
3.1 Monitoring Layers¶
┌─────────────────────────────────────────────────────────────┐
│ MONITORING LAYERS │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Layer 1: Infrastructure] │
│ • CPU, memory, disk, network │
│ • VM/container health │
│ • Load balancer health │
│ • Database connections │
│ • Tools: Prometheus, CloudWatch, Datadog │
│ │
│ [Layer 2: Application] │
│ • Request rate, error rate, latency │
│ • Business metrics (signups, purchases) │
│ • Application logs │
│ • Distributed traces │
│ • Tools: New Relic, AppDynamics, custom metrics │
│ │
│ [Layer 3: Pipeline] │
│ • CI/CD pipeline status │
│ • Deployment frequency │
│ • Deployment success rate │
│ • Rollback frequency │
│ • Tools: GitHub Actions metrics, custom dashboards │
│ │
│ [Layer 4: AI Agent] (Chapter 10) │
│ • AI Agent decision rate │
│ • AI Agent confidence scores │
│ • AI Agent escalation rate │
│ • AI Agent accuracy │
│ • Tools: Custom AI Agent monitoring (Section 6) │
│ │
└─────────────────────────────────────────────────────────────┘
3.2 Key Metrics to Track¶
┌─────────────────────────────────────────────────────────────┐
│ KEY METRICS BY LAYER │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Infrastructure Metrics (RED Method)] │
│ • Rate: Requests per second │
│ • Errors: Error rate (%) │
│ • Duration: Latency (p50, p95, p99) │
│ • Saturation: CPU, memory, disk usage │
│ │
│ [Application Metrics (Four Golden Signals)] │
│ • Latency: Time to serve requests │
│ • Traffic: Demand on system │
│ • Errors: Rate of failed requests │
│ • Saturation: How "full" the service is │
│ │
│ [Pipeline Metrics (DORA Metrics)] │
│ • Deployment Frequency: How often you deploy │
│ • Lead Time for Changes: Commit to deploy │
│ • Change Failure Rate: % of deployments causing issues │
│ • Mean Time to Recovery: Time to fix incidents │
│ │
│ [AI Agent Metrics] (Chapter 10) │
│ • Decision Accuracy: % of correct decisions │
│ • Confidence Score: AI confidence in decisions │
│ • Escalation Rate: % escalated to humans │
│ • Auto-Remediation Success: % successful auto-fixes │
│ │
└─────────────────────────────────────────────────────────────┘
3.3 Monitoring Configuration Template¶
File: monitoring/config/prometheus-rules.yml
# Prometheus Monitoring Rules
groups:
- name: infrastructure
interval: 30s
rules:
- alert: HighCPUUsage
expr: avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for 5 minutes"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 90% for 5 minutes"
- alert: HighDiskUsage
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "High disk usage detected"
description: "Disk usage is above 85% for 10 minutes"
- name: application
interval: 30s
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for 5 minutes"
- alert: HighLatency
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P99 latency is above 1 second for 5 minutes"
- name: pipeline
interval: 60s
rules:
- alert: PipelineFailure
expr: ci_pipeline_status{status="failed"} == 1
for: 0m
labels:
severity: warning
annotations:
summary: "CI/CD pipeline failed"
description: "Pipeline {{ $labels.pipeline }} failed"
- alert: HighRollbackRate
expr: sum(rate(deployment_rollback_total[1h])) / sum(rate(deployment_total[1h])) > 0.1
for: 1h
labels:
severity: warning
annotations:
summary: "High rollback rate detected"
description: "Rollback rate is above 10% in the last hour"
3.4 Log Aggregation Configuration¶
File: monitoring/config/fluentd-config.yml
# Fluentd Log Aggregation Configuration
<system>
log_level info
</system>
<source>
@type tail
path /var/log/application/*.log
pos_file /var/log/fluentd/application.log.pos
tag application.*
<parse>
@type json
</parse>
</source>
<source>
@type tail
path /var/log/audit/*.log
pos_file /var/log/fluentd/audit.log.pos
tag audit.*
<parse>
@type json
</parse>
</source>
<match application.**>
@type elasticsearch
host elasticsearch.monitoring.svc
port 9200
index_name application-logs
<buffer>
@type file
path /var/log/fluentd/buffer/application
flush_interval 5s
</buffer>
</match>
<match audit.**>
@type elasticsearch
host elasticsearch.monitoring.svc
port 9200
index_name audit-logs
<buffer>
@type file
path /var/log/fluentd/buffer/audit
flush_interval 5s
</buffer>
</match>
3.5 Distributed Tracing Configuration¶
File: monitoring/config/jaeger-config.yml
# Jaeger Distributed Tracing Configuration
service_name: my-application
sampler:
type: probabilistic
param: 0.1 # Sample 10% of traces
reporter:
log_spans: true
local_agent:
reporting_host: jaeger.monitoring.svc
reporting_port: 6831
tags:
environment: production
version: ${APP_VERSION}
service: ${SERVICE_NAME}
4. Part 3: Alerting Strategy – When to Alert and Who to Notify ¶
4.1 Alert Severity Levels¶
┌─────────────────────────────────────────────────────────────┐
│ ALERT SEVERITY LEVELS │
├─────────────────────────────────────────────────────────────┤
│ │
│ [P1: Critical] │
│ • Impact: Production down, customers affected │
│ • Response Time: <15 minutes │
│ • Notification: PagerDuty + Slack + Phone │
│ • On-Call: Primary + Secondary │
│ • Examples: Complete outage, security breach, data loss │
│ │
│ [P2: High] │
│ • Impact: Major functionality impaired │
│ • Response Time: <30 minutes │
│ • Notification: PagerDuty + Slack │
│ • On-Call: Primary │
│ • Examples: Partial outage, performance degradation │
│ │
│ [P3: Medium] │
│ • Impact: Minor functionality impaired │
│ • Response Time: <2 hours │
│ • Notification: Slack │
│ • On-Call: Primary (during business hours) │
│ • Examples: Non-critical bug, UI issues │
│ │
│ [P4: Low] │
│ • Impact: Minimal, workaround available │
│ • Response Time: <24 hours │
│ • Notification: Email │
│ • On-Call: No on-call, ticket created │
│ • Examples: Cosmetic issues, documentation gaps │
│ │
└─────────────────────────────────────────────────────────────┘
4.2 Alert Routing Configuration¶
File: monitoring/config/alertmanager-routes.yml
# Alertmanager Routing Configuration
route:
receiver: default
group_by: ['alertname', 'severity', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty-critical
continue: true
- match:
severity: critical
receiver: slack-critical
- match:
severity: warning
receiver: slack-warning
- match:
severity: info
receiver: email-info
- match:
team: security
receiver: slack-security
- match:
team: infrastructure
receiver: slack-infra
receivers:
- name: default
email_configs:
- to: team@example.com
- name: pagerduty-critical
pagerduty_configs:
- service_key: ${PAGERDUTY_SERVICE_KEY}
severity: critical
- name: slack-critical
slack_configs:
- api_url: ${SLACK_WEBHOOK_CRITICAL}
channel: '#incidents-critical'
title: '🚨 CRITICAL ALERT'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: slack-warning
slack_configs:
- api_url: ${SLACK_WEBHOOK_WARNING}
channel: '#incidents-warning'
title: '⚠️ WARNING ALERT'
- name: email-info
email_configs:
- to: team@example.com
send_resolved: true
- name: slack-security
slack_configs:
- api_url: ${SLACK_WEBHOOK_SECURITY}
channel: '#security-alerts'
title: '🔒 SECURITY ALERT'
- name: slack-infra
slack_configs:
- api_url: ${SLACK_WEBHOOK_INFRA}
channel: '#infrastructure-alerts'
title: '🖥️ INFRASTRUCTURE ALERT'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'service']
4.3 Alert Fatigue Prevention¶
┌─────────────────────────────────────────────────────────────┐
│ ALERT FATIGUE PREVENTION │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Problem] │
│ • Too many alerts │
│ • Team ignores alerts │
│ • Real incidents missed │
│ • On-call burnout │
│ │
│ [Solutions] │
│ • Alert on symptoms, not causes │
│ • Use multi-condition alerts │
│ • Implement alert deduplication │
│ • Regular alert review (monthly) │
│ • Auto-resolve stale alerts │
│ • Require runbook for every alert │
│ │
│ [Alert Quality Checklist] │
│ □ Is this alert actionable? │
│ □ Does it have a runbook? │
│ □ Is the threshold appropriate? │
│ □ Is the severity correct? │
│ □ Is the right team notified? │
│ □ Has this alert fired in the last 30 days? │
│ □ If no fires in 30 days, should it be removed? │
│ │
│ [Monthly Alert Review] │
│ • Review all alerts that fired │
│ • Remove alerts that never fire │
│ • Adjust thresholds based on data │
│ • Update runbooks │
│ • Document lessons learned │
│ │
└─────────────────────────────────────────────────────────────┘
4.4 Alert Runbook Template¶
# Alert Runbook Template
## Alert Name: [Alert Name]
## Severity: [P1/P2/P3/P4]
## Description:
[What this alert means]
## Trigger Conditions:
[When this alert fires]
## Impact:
[What is affected when this alert fires]
## Immediate Actions:
1. [Step 1]
2. [Step 2]
3. [Step 3]
## Investigation:
1. [Check metric X]
2. [Check log Y]
3. [Check trace Z]
## Resolution:
1. [Fix step 1]
2. [Fix step 2]
3. [Verify fix]
## Rollback:
[If fix makes things worse, how to rollback]
## Escalation:
- If not resolved in 30 minutes: Escalate to [role]
- If not resolved in 1 hour: Escalate to [role]
## Related Alerts:
- [Related alert 1]
- [Related alert 2]
## Related Runbooks:
- [Related runbook 1]
- [Related runbook 2]
## Last Updated: [DATE]
## Owner: [NAME/ROLE]
5. Part 4: Dashboards & Visualization – Making Data Actionable ¶
5.1 Dashboard Types¶
┌─────────────────────────────────────────────────────────────┐
│ DASHBOARD TYPES │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Executive Dashboard] │
│ • Audience: Leadership, non-technical │
│ • Metrics: Business KPIs, uptime, incidents │
│ • Refresh: Hourly │
│ • Example: System health, customer impact │
│ │
│ [Operations Dashboard] │
│ • Audience: On-call, operations team │
│ • Metrics: All technical metrics, alerts │
│ • Refresh: Real-time │
│ • Example: Service health, active incidents │
│ │
│ [Development Dashboard] │
│ • Audience: Developers │
│ • Metrics: Deployment metrics, test results │
│ • Refresh: Real-time │
│ • Example: Pipeline status, code coverage │
│ │
│ [AI Agent Dashboard] (Chapter 10) │
│ • Audience: Engineering, AI team │
│ • Metrics: AI Agent decisions, accuracy, escalations │
│ • Refresh: Real-time │
│ • Example: AI Agent performance, human overrides │
│ │
└─────────────────────────────────────────────────────────────┘
5.2 Dashboard Best Practices¶
┌─────────────────────────────────────────────────────────────┐
│ DASHBOARD BEST PRACTICES │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Design Principles] │
│ • Start with questions, not metrics │
│ • One dashboard, one purpose │
│ • Use appropriate visualizations │
│ • Include context (baselines, thresholds) │
│ • Make it actionable │
│ │
│ [What to Include] │
│ • Current status (green/yellow/red) │
│ • Trends over time │
│ • Key metrics (limited to 5-10) │
│ • Links to related dashboards │
│ • Links to runbooks │
│ │
│ [What to Avoid] │
│ • Too many metrics (dashboard overload) │
│ • Metrics without context │
│ • Static dashboards (no time range selection) │
│ • Dashboards without owners │
│ • Dashboards that no one looks at │
│ │
│ [Maintenance] │
│ • Review dashboards quarterly │
│ • Remove unused dashboards │
│ • Update as services change │
│ • Document dashboard purpose │
│ │
└─────────────────────────────────────────────────────────────┘
5.3 Grafana Dashboard Template¶
File: monitoring/dashboards/production-overview.json
{
"dashboard": {
"title": "Production Overview",
"tags": ["production", "overview"],
"timezone": "browser",
"panels": [
{
"title": "System Health",
"type": "stat",
"targets": [
{
"expr": "up{environment=\"production\"}",
"legendFormat": "{{service}}"
}
],
"thresholds": [
{"value": 0, "color": "red"},
{"value": 1, "color": "green"}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\",environment=\"production\"}[5m])) / sum(rate(http_requests_total{environment=\"production\"}[5m])) * 100",
"legendFormat": "Error Rate %"
}
],
"thresholds": [
{"value": 1, "color": "yellow"},
{"value": 5, "color": "red"}
]
},
{
"title": "Latency (P99)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{environment=\"production\"}[5m])) by (le))",
"legendFormat": "P99 Latency"
}
],
"thresholds": [
{"value": 0.5, "color": "yellow"},
{"value": 1, "color": "red"}
]
},
{
"title": "Deployment Status",
"type": "table",
"targets": [
{
"expr": "deployment_info{environment=\"production\"}",
"format": "table"
}
]
},
{
"title": "Active Incidents",
"type": "alertlist",
"alerts": {
"state": ["alerting"],
"tags": ["production"]
}
}
],
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
}
}
}
6. Part 5: AI Agent Monitoring – Special Considerations for Chapter 10 ¶
6.1 AI Agent Metrics to Track¶
┌─────────────────────────────────────────────────────────────┐
│ AI AGENT METRICS (Chapter 10 Preview) │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Decision Metrics] │
│ • Total decisions made │
│ • Decisions by type (deploy/rollback/escalate) │
│ • Decision confidence scores │
│ • Decision accuracy (vs. human decisions) │
│ │
│ [Performance Metrics] │
│ • Decision latency (time to decide) │
│ • Action execution time │
│ • API call success rate │
│ • Rate limit hits │
│ │
│ [Safety Metrics] │
│ • Escalation rate (to humans) │
│ • Human override rate │
│ • Boundary violations │
│ • Emergency stop activations │
│ │
│ [Learning Metrics] │
│ • Model accuracy over time │
│ • False positive rate │
│ • False negative rate │
│ • Learning implementation rate │
│ │
└─────────────────────────────────────────────────────────────┘
6.2 AI Agent Monitoring Configuration¶
File: monitoring/config/ai-agent-rules.yml
# AI Agent Monitoring Rules (Chapter 10)
groups:
- name: ai-agent
interval: 30s
rules:
- alert: AI Agent Low Confidence
expr: avg(ai_agent_confidence_score[5m]) < 0.7
for: 5m
labels:
severity: warning
annotations:
summary: "AI Agent confidence is low"
description: "AI Agent average confidence is below 70%"
- alert: AI Agent High Escalation Rate
expr: sum(rate(ai_agent_escalations_total[1h])) / sum(rate(ai_agent_decisions_total[1h])) > 0.3
for: 1h
labels:
severity: warning
annotations:
summary: "AI Agent escalation rate is high"
description: "AI Agent is escalating more than 30% of decisions"
- alert: AI Agent Boundary Violation
expr: increase(ai_agent_boundary_violations_total[1h]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "AI Agent boundary violation detected"
description: "AI Agent attempted to violate boundaries"
- alert: AI Agent Decision Accuracy Drop
expr: avg(ai_agent_decision_accuracy[24h]) < 0.85
for: 24h
labels:
severity: warning
annotations:
summary: "AI Agent decision accuracy has dropped"
description: "AI Agent accuracy is below 85% over 24 hours"
6.3 AI Agent Dashboard Template¶
File: monitoring/dashboards/ai-agent-overview.json
{
"dashboard": {
"title": "AI Agent Overview",
"tags": ["ai-agent", "automation"],
"panels": [
{
"title": "AI Agent Decisions",
"type": "stat",
"targets": [
{
"expr": "sum(ai_agent_decisions_total)",
"legendFormat": "Total Decisions"
}
]
},
{
"title": "Decision Confidence",
"type": "gauge",
"targets": [
{
"expr": "avg(ai_agent_confidence_score)",
"legendFormat": "Avg Confidence"
}
],
"thresholds": [
{"value": 0.5, "color": "red"},
{"value": 0.7, "color": "yellow"},
{"value": 0.85, "color": "green"}
]
},
{
"title": "Escalation Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(ai_agent_escalations_total[1h])) / sum(rate(ai_agent_decisions_total[1h])) * 100",
"legendFormat": "Escalation Rate %"
}
],
"thresholds": [
{"value": 20, "color": "yellow"},
{"value": 30, "color": "red"}
]
},
{
"title": "Decision Accuracy",
"type": "graph",
"targets": [
{
"expr": "avg(ai_agent_decision_accuracy)",
"legendFormat": "Accuracy %"
}
],
"thresholds": [
{"value": 85, "color": "yellow"},
{"value": 95, "color": "green"}
]
},
{
"title": "Boundary Violations",
"type": "stat",
"targets": [
{
"expr": "sum(ai_agent_boundary_violations_total)",
"legendFormat": "Violations"
}
],
"thresholds": [
{"value": 0, "color": "green"},
{"value": 1, "color": "red"}
]
},
{
"title": "Recent Decisions",
"type": "table",
"targets": [
{
"expr": "ai_agent_decisions_total",
"format": "table"
}
]
}
],
"refresh": "30s"
}
}
6.4 AI Agent Audit Trail¶
# AI Agent Audit Trail Requirements
## What to Log:
- All AI Agent decisions (with rationale)
- All AI Agent actions (deploy, rollback, escalate, block)
- All human approvals/rejections of AI recommendations
- All AI Agent boundary violations
- All AI Agent emergency stop activations
- All AI Agent rule changes
## Log Format:
```json
{
"timestamp": "2024-01-15T10:30:00Z",
"agent_id": "deployment-agent-01",
"decision": "deploy",
"version": "v1.0.1",
"environment": "staging",
"risk_level": "low",
"confidence_score": 0.92,
"approval_required": false,
"approver": null,
"outcome": "success",
"duration": "45s",
"rationale": "PATCH version, tests passed, security scan passed"
}
Retention:¶
- AI Agent decisions: 7 years
- AI Agent boundary violations: 7 years
- AI Agent rule changes: 7 years
- AI Agent learning updates: 2 years
Access:¶
- Engineers: Read own AI Agent decisions
- Team leads: Read team AI Agent decisions
- Security: Read all AI Agent logs
- Compliance: Read all AI Agent logs
- Auditors: Read all AI Agent logs (time-limited)
--- ## 7. Part 6: VSCode Integration for Monitoring Workflows <a name="7-part-6-vscode-integration"></a> ### 7.1 Continue.dev Configuration for Monitoring **File:** `~/.continue/config.json` ```json { "models": [ { "title": "🔵 Qwen-2.5-Coder (Monitoring Code)", "provider": "openai", "model": "qwen-2.5-coder", "apiKey": "${QWEN_API_KEY}", "apiBase": "https://dashscope.aliyuncs.com/compatible-mode/v1", "default": true }, { "title": "🟢 DeepSeek-V3 (Monitoring Logic)", "provider": "openai", "model": "deepseek-chat", "apiKey": "${DEEPSEEK_API_KEY}", "apiBase": "https://api.deepseek.com/v1" }, { "title": "🟠 Claude-3.5-Sonnet (Alert Review)", "provider": "anthropic", "model": "claude-3-5-sonnet-20241022", "apiKey": "${ANTHROPIC_API_KEY}" } ], "customCommands": [ { "name": "monitoring-metric", "prompt": "Generate monitoring metric configuration for {{{ input }}}. CRITICAL: 1) Follow monitoring architecture from Chapter 8, 2) Include appropriate thresholds, 3) Include alert routing, 4) Include runbook reference. Follow Chapter 8 templates.", "description": "Generate monitoring metric configuration" }, { "name": "alert-rule", "prompt": "Generate alert rule for {{{ input }}}. Include: 1) Alert expression, 2) Severity level, 3) Notification channels, 4) Runbook reference. Follow Chapter 8 alerting strategy.", "description": "Generate alert rule" }, { "name": "alert-runbook", "prompt": "Generate alert runbook for {{{ input }}}. Include: 1) Alert description, 2) Trigger conditions, 3) Immediate actions, 4) Investigation steps, 5) Resolution steps, 6) Escalation procedure. Follow Chapter 8 runbook template.", "description": "Generate alert runbook" }, { "name": "dashboard-panel", "prompt": "Generate Grafana dashboard panel for {{{ input }}}. Include: 1) Panel type, 2) Query expression, 3) Thresholds, 4) Visualization options. Follow Chapter 8 dashboard best practices.", "description": "Generate Grafana dashboard panel" }, { "name": "ai-agent-metric", "prompt": "Generate AI Agent monitoring metric for {{{ input }}}. Include: 1) Metric definition, 2) Alert thresholds, 3) Dashboard panel, 4) Audit trail requirements. Follow Chapter 8 AI Agent monitoring (Chapter 10 preparation).", "description": "Generate AI Agent monitoring metric" } ] }
7.2 VSCode Snippets for Monitoring¶
File: ~/.vscode/snippets/monitoring.json
{
"Prometheus Alert Rule": {
"prefix": "prom-alert",
"body": [
"- alert: ${1:AlertName}",
" expr: ${2:expression}",
" for: ${3:5m}",
" labels:",
" severity: ${4:warning}",
" annotations:",
" summary: \"${5:Alert summary}\"",
" description: \"${6:Alert description}\"",
" runbook: \"${7:URL to runbook}\""
],
"description": "Prometheus alert rule template"
},
"Alert Runbook": {
"prefix": "alert-runbook",
"body": [
"# Alert Runbook: ${1:Alert Name}",
"",
"## Severity: ${2:P1/P2/P3/P4}",
"",
"## Description:",
"${3:What this alert means}",
"",
"## Trigger Conditions:",
"${4:When this alert fires}",
"",
"## Immediate Actions:",
"1. ${5:Step 1}",
"2. ${6:Step 2}",
"3. ${7:Step 3}",
"",
"## Investigation:",
"1. ${8:Check metric X}",
"2. ${9:Check log Y}",
"3. ${10:Check trace Z}",
"",
"## Resolution:",
"1. ${11:Fix step 1}",
"2. ${12:Fix step 2}",
"3. ${13:Verify fix}",
"",
"## Escalation:",
"- If not resolved in 30 minutes: Escalate to ${14:role}",
"- If not resolved in 1 hour: Escalate to ${15:role}",
"",
"## Last Updated: ${16:DATE}",
"## Owner: ${17:NAME/ROLE}"
],
"description": "Alert runbook template"
},
"Grafana Panel": {
"prefix": "grafana-panel",
"body": [
"{",
" \"title\": \"${1:Panel Title}\",",
" \"type\": \"${2:graph}\",",
" \"targets\": [",
" {",
" \"expr\": \"${3:prometheus_expression}\",",
" \"legendFormat\": \"${4:Legend}\"",
" }",
" ],",
" \"thresholds\": [",
" {\"value\": ${5:0}, \"color\": \"${6:red}\"},",
" {\"value\": ${7:1}, \"color\": \"${8:green}\"}",
" ]",
"}"
],
"description": "Grafana panel template"
},
"AI Agent Metric": {
"prefix": "ai-agent-metric",
"body": [
"# AI Agent Metric: ${1:Metric Name}",
"",
"## Definition:",
"${2:What this metric measures}",
"",
"## Expression:",
"```promql",
"${3:prometheus_expression}",
"```",
"",
"## Thresholds:",
"- Warning: ${4:threshold}",
"- Critical: ${5:threshold}",
"",
"## Alert:",
"- Name: ${6:alert_name}",
"- Severity: ${7:P1/P2/P3/P4}",
"- Notification: ${8:channels}",
"",
"## Dashboard:",
"- Panel Type: ${9:type}",
"- Refresh: ${10:30s}",
"",
"## Audit Trail:",
"- Log: ${11:YES/NO}",
"- Retention: ${12:7 years}"
],
"description": "AI Agent monitoring metric template"
}
}
8. Part 7: Iteration Points – Your Feedback Needed ¶
8.1 This Chapter's Core Message¶
"You can't automate what you can't observe. This chapter provides the monitoring, observability, and alerting foundation that Chapters 3-7 operate within, and that Chapter 10 AI Agents need to make informed decisions."
8.2 Questions for Your Feedback¶
□ Question 1: Does the monitoring vs. observability distinction come through clearly?
- Is this the right framing for your experience?
- What would make it clearer?
□ Question 2: Are the monitoring layers comprehensive?
- Do you monitor infrastructure, application, and pipeline?
- What's missing?
□ Question 3: Is the alerting strategy practical?
- Do you have alert severity levels?
- What would you change?
□ Question 4: Are the dashboard best practices useful?
- Do your dashboards follow these principles?
- What would you add?
□ Question 5: Is the AI Agent monitoring section helpful?
- Does this prepare you for Chapter 10?
- What metrics are missing?
□ Question 6: Is the VSCode integration practical?
- Do the custom commands make sense?
- What workflows would save you time?
□ Question 7: What's missing?
- What topics should be added?
- What should be removed or condensed?
9. Appendix: Monitoring Templates & Configurations ¶
9.1 Monitoring Checklist¶
# Monitoring Implementation Checklist
## Infrastructure Monitoring:
□ CPU, memory, disk metrics collected
□ Network metrics collected
□ Load balancer health monitored
□ Database connections monitored
□ Alerts configured for critical thresholds
## Application Monitoring:
□ Request rate monitored
□ Error rate monitored
□ Latency (p50, p95, p99) monitored
□ Business metrics tracked
□ Distributed tracing enabled
## Pipeline Monitoring:
□ CI/CD pipeline status monitored
□ Deployment frequency tracked
□ Deployment success rate tracked
□ Rollback frequency tracked
□ DORA metrics calculated
## Alerting:
□ Alert severity levels defined
□ Alert routing configured
□ Alert runbooks created
□ Alert fatigue prevention implemented
□ Monthly alert review scheduled
## Dashboards:
□ Executive dashboard created
□ Operations dashboard created
□ Development dashboard created
□ AI Agent dashboard prepared (Chapter 10)
□ Dashboard review scheduled quarterly
## AI Agent Monitoring (Chapter 10):
□ AI Agent decision metrics defined
□ AI Agent confidence tracking enabled
□ AI Agent escalation rate monitored
□ AI Agent boundary violations logged
□ AI Agent audit trail configured
## Sign-Off:
□ Engineering Lead: ________________ Date: ________
□ Operations Lead: ________________ Date: ________
□ Security Lead: ________________ Date: ________
9.2 The Chapter 8 Checklist¶
# Chapter 8: Monitoring, Observability & Alerting - Checklist
## Monitoring Architecture:
□ Infrastructure monitoring enabled (Section 3.1)
□ Application monitoring enabled (Section 3.1)
□ Pipeline monitoring enabled (Section 3.1)
□ AI Agent monitoring prepared (Section 6)
## Alerting:
□ Alert severity levels defined (Section 4.1)
□ Alert routing configured (Section 4.2)
□ Alert runbooks created (Section 4.4)
□ Alert fatigue prevention implemented (Section 4.3)
## Dashboards:
□ Executive dashboard created (Section 5.1)
□ Operations dashboard created (Section 5.1)
□ Development dashboard created (Section 5.1)
□ AI Agent dashboard prepared (Section 6.3)
## AI Agent Preparation (Chapter 10):
□ AI Agent metrics defined (Section 6.1)
□ AI Agent monitoring configured (Section 6.2)
□ AI Agent audit trail prepared (Section 6.4)
## Key Principle:
"You can't automate what you can't observe. Monitoring is the foundation for AI Agents."
Chapter Summary¶
The Core Message¶
┌─────────────────────────────────────────────────────────────┐
│ CHAPTER 8 IN ONE SENTENCE │
├─────────────────────────────────────────────────────────────┤
│ │
│ "You can't automate what you can't observe. This chapter │
│ provides the monitoring, observability, and alerting │
│ foundation that Chapters 3-7 operate within, and that │
│ Chapter 10 AI Agents need to make informed decisions." │
│ │
└─────────────────────────────────────────────────────────────┘
Key Takeaways¶
✅ Monitoring vs. observability – Understand the difference
✅ Monitoring architecture: Infrastructure, application, pipeline, AI Agent
✅ Alerting strategy: Severity levels, routing, runbooks
✅ Dashboards: Executive, operations, development, AI Agent
✅ AI Agent monitoring: Special considerations for Chapter 10
✅ VSCode integration: Monitoring templates and workflows
✅ Chapter 10: AI Agents need this monitoring data to decide
Connection to Other Chapters¶
| Chapter | Connection |
|---|---|
| Chapter 3 | InfraCtl structure → Monitoring validates structure |
| Chapter 4 | Ansible structure → Monitoring validates deployment |
| Chapter 5 | CI/CD structure → Monitoring validates pipelines |
| Chapter 6 | Production deployment → Monitoring validates production |
| Chapter 7 | Governance → Monitoring enforces governance |
| Chapter 8 | Monitoring, Observability & Alerting |
| Chapter 9 | Continuous Improvement → Monitoring provides data |
| Chapter 10 | AI Agents → USE this monitoring data to decide |
Book Progress¶
✅ Chapter 1: AI Foundations (Symbolic + Data-Driven)
✅ Chapter 2: VSCode AI Integration
✅ Chapter 3: Structured IaC (InfraCtl)
✅ Chapter 4: Structured Deployment (Ansible)
✅ Chapter 5: Structured CI/CD (Pipelines + Runners)
✅ Chapter 6: Production Deployment & Release Management
✅ Chapter 7: Governance, Safety & Compliance
✅ Chapter 8: Monitoring, Observability & Alerting
Next:
□ Chapter 9: Continuous Improvement & Learning
□ Chapter 10: AI Agents (Culmination)
□ Index: Quick Reference & Publishing
Document Version: 0.1 (Draft for Iteration) Part of: The DevOps Engineer's Guide to Effective AI Usage Last Updated: [Current Date] Prepared By: [Your Name]
This is a DRAFT for iteration. Please provide feedback on Section 8.2 questions. After your review, I'll proceed to Chapter 9 (Continuous Improvement & Learning). The core message is: You can't automate what you can't observe. AI Agents (Chapter 10) need this monitoring data to make informed decisions.