Service Health & Deployment Diagnostics

github

datadog

kubernetes

slack

Automate health checks, resource diagnostics, and deployment visibility for your services. Get real-time insights and on-call notifications via Slack.

TL;DR

This runbook automates a full diagnostic sweep for your services — including health checks, deployment visibility, resource monitoring, latency, and error rate analysis. After execution, it posts a detailed summary to Slack for on-call awareness.

Who is this for?

Site Reliability Engineers (SREs), platform engineers, and DevOps professionals responsible for infrastructure uptime, performance monitoring, and incident response.

What problem does this solve?

Without automation, monitoring health, deployment activity, and performance metrics across multiple platforms (Datadog, Kubernetes, GitHub) can be slow and error-prone.

This runbook solves:

Manual health verification across tools
Lag in detecting performance issues
Delays in team notification during failures or anomalies

What this workflow accomplishes

Verifies API health of your services
Checks for new deployments within the past 20 minutes
Audits CPU and memory usage using Datadog and Kubernetes
Identifies latency and error rate anomalies
Scans recent logs for application errors
Sends structured summary reports to a Slack channel with on-call alerts

Integrations

This runbook uses the following integrations:

GitHub Agent: Checks for recent deployments
Datadog Agent: Collects resource usage, latency, and error metrics
Kubernetes Agent: Retrieves pod-level resource usage
Slack Agent: Sends detailed runbook results and alerts

Setup

GitHub:
- Repo access token with deployment read permissions
Datadog:
- Valid API key and App key
- Required scopes: Metrics read, Logs read
Kubernetes:
- Bearer Token - A long-lived ServiceAccount Token
- Cluster CA Certificate - The cluster’s root certificate for TLS verification
- API Server URL - The Kubernetes cluster endpoint URL
- Access to query pod metrics in the workspaces namespace
Slack:
- Bot token with chat:write permission
- Channel ID: #random

Runbook Template

📚 runbook.mdx

Runbook

Objective: Automate deployment verification for {{service}} (can be all services) and send detailed summary reports to Slack.

Steps:

(1) Verify Service Health

Execute a GET request using the HTTP Agent to:

{{service_url}}/healthy

Confirm that the service returns a healthy status.

(2) Fetch Recent Deployments (Github)

Query for deployments of the {{service}} service using the GitHub Agent within the last 20 minutes.

(3) Analyze Resource Usage (Datadog)

Check CPU and Memory usage of the deployed application using the Datadog Agent over the past 20 minutes. Ensure values are below 80% for both metrics.

(4) Kubernetes Resource Usage (Kubernetes)

Retrieve memory and CPU usage of all pods in the {{service}} namespace using the Kubernetes Agent. Highlight any pods exceeding resource limits.

(5) Latency Check (Datadog)

Run the following query to detect latency spikes in the last 20 minutes using the Datadog Agent:

trace.go.opentelemetry.io_contrib_instrumentation_github.com_gin_gonic_gin_otelgin.server{service:{{service}}}

Check for latency anomalies or spikes over the last 20 minutes.

(6) Error Rate Analysis (Datadog)

Use this query to retrieve top error logs from the last 20 minutes using the Datadog Agent: status:error service:{{service}}

Monitor error rates and identify any recent spikes.

(7) Log Analysis (Datadog)

Check all error logs with the following query: service:{{service}} status:error

Look for critical or recurring error messages.

(8) Slack Notification

Send a message to Slack channel #on-call, tagging @team-oncall with the following format:

🚨 Runbook Executed: {{service}} Service Diagnostics

📅 Date & Time: {{execution_time}}
🔧 Service: {{service}}
🌐 Environment: {{environment_name}}
📁 Repository: {{service}}

Summary:
All diagnostic checks have been completed.

Detailed Steps:
-Health Check - ✅ Passed
- Deployment Activity - ✅ Found recent deployment
- Datadog Resource Usage - ✅ CPU/Memory below threshold
- Kubernetes Resource Usage - ✅ All pods within safe range
- Latency Metrics - ✅ No spikes
- Error Rate - ✅ Normal
- Logs - ✅ No recent critical errors

Result: ✅ SUCCESS

If failures or anomalies are detected, adjust Result to:
- ❌ FAILURE
- ⚠️ CRITICAL

Add footer:

:alert: @team-oncall please review the runbook execution results.

Alexis Warner

Marketing

May 30, 2025

•

5 min read

About this post

Alexis Warner

Marketing

Last updated: May 30, 2025

5 min read

Agents Used

HTTP Agent GitHub logo

GitHub Agent Datadog logo

Datadog Agent Kubernetes logo

Kubernetes Agent Slack logo

Slack Agent

bearify