Investigate your PagerDuty alerts

pagerduty

github

datadog

kubernetes

slack

Run automated health checks, deployment analysis, and performance diagnostics for your services. Summarize and alert on-call teams via Slack.

TL;DR

This runbook performs a complete diagnostic of a service, including health checks, recent deployment verification, resource monitoring, latency tracking, and error log analysis. After execution, results are shared with the on-call team in Slack.

Who is this for?

SREs, DevOps teams, and platform engineers responsible for maintaining the reliability and performance of backend services.

What problem does this solve?

Manually checking service health, recent deployments, and performance indicators across multiple systems is time-consuming and error-prone.

This runbook solves:

Lack of centralized diagnostic automation
Delays in identifying critical issues
Inconsistent on-call communication during incidents

What this workflow accomplishes

Confirms service health via HTTP check
Checks for deployments in the past 20 minutes
Audits CPU and memory usage via Datadog and Kubernetes
Identifies latency issues and error log patterns
Summarizes results and alerts via Slack

Integrations

This runbook uses the following integrations:

PagerDuty Agent: Retrieves incident and alert details
GitHub Agent: Verifies recent deployments
Datadog Agent: Gathers metrics and logs
Kubernetes Agent: Monitors pod resource consumption
Slack Agent: Sends structured reports to alert teams

Setup

PagerDuty:
- API token with read access to incidents and services
- Required permissions: Read incidents, services, and escalation policies
GitHub:
- Access token with deployment read permissions
Datadog:
- API key and App key
- Required scopes: Metrics and Logs read access
Kubernetes:
- Bearer Token - A long-lived ServiceAccount Token
- Cluster CA Certificate - The cluster’s root certificate for TLS verification
- API Server URL - The Kubernetes cluster endpoint URL
- Access to the specific namespace
Slack:
- Bot token with chat:write scope

Runbook Template

📚 runbook.mdx

Runbook

Objective: Run health and performance diagnostics for service {{name}} (you can also do all services).

Steps:

(1) Retrieve PagerDuty Incident Details

Use the PagerDuty Agent to fetch incident details for service {{service}}.
Gather information about:
- Current incident status and priority
- Time since incident creation
- Assigned escalation policy
- Recent incident history for the service

(2) Verify Service Health

Execute a GET request using the HTTP Agent to:

{{service_url}}/healthy

Confirm that the service returns a healthy status.

(3) Fetch Recent Deployments (Github)

Query for deployments of the {{service}} service using the GitHub Agent within the last 20 minutes.

(4) Analyze Resource Usage (Datadog)

Check CPU and Memory usage of the deployed application using the Datadog Agent over the past 15 minutes. Ensure values are below 80% for both metrics.

(5) Kubernetes Resource Usage (Kubernetes)

Retrieve memory and CPU usage of all pods in the {{service}} namespace using the Kubernetes Agent. Highlight any pods nearing resource limits.

(6) Latency Check (Datadog)

Run the following query to detect latency spikes in the last 30 minutes using the Datadog Agent:

trace.go.opentelemetry.io_contrib_instrumentation_github.com_gin_gonic_gin_otelgin.server{service:{{service}}}

Evaluate average and peak latency values.

(7) Top Error Logs (Datadog)

Use this query to retrieve top error logs from the last 15 minutes using the Datadog Agent: status:error service:{{service}}

(8) Full Error Log Analysis (Datadog)

Check all error logs with the following query: service:{{service}} status:error
Review for any recurring or critical error patterns.

(9) Slack Notification

Post a message to #on-call tagging @team-oncall with the following structure:

📘 Runbook Executed: {{service}} Service Diagnostics

📅 Date & Time: {{execution_time}}
🔧 Service: {{service}}
🌐 Environment: {{environment_name}}
📁 Repository: {{service}}
🚨 PagerDuty Incident: {{incident_id}} ({{incident_status}})

Summary:
All diagnostic steps completed.

Detailed Steps:
- PagerDuty Analysis - ✅ Incident details retrieved
- Health Check - ✅ Healthy response from endpoint
- Deployment Check - ✅ Deployment found within last 20 min
- Datadog Resource Usage - ✅ Below threshold
- Kubernetes Pod Usage - ✅ No anomalies
- Latency - ✅ No spikes detected
- Top Error Logs - ✅ No critical issues
- Log Analysis - ✅ Clean

Result: ✅ SUCCESS

For FAILURE or CRITICAL, update result line accordingly and include this footer:

:alert: @team-oncall please check the runbook results for the runbook that was executed.

Alexis Warner

Marketing

May 30, 2025

•

5 min read

About this post

Alexis Warner

Marketing

Last updated: May 30, 2025

5 min read

Agents Used

PagerDuty AgentHTTP Agent GitHub logo

GitHub Agent Datadog logo

Datadog Agent Kubernetes logo

Kubernetes Agent Slack logo

Slack Agent

bearify