Skip to content

bearify

bearify
Home
Use Cases

Investigate your PagerDuty alerts

pagerduty logo

pagerduty

github logo

github

datadog logo

datadog

kubernetes logo

kubernetes

slack logo

slack

Run automated health checks, deployment analysis, and performance diagnostics for your services. Summarize and alert on-call teams via Slack.

TL;DR

This runbook performs a complete diagnostic of a service, including health checks, recent deployment verification, resource monitoring, latency tracking, and error log analysis. After execution, results are shared with the on-call team in Slack.

Who is this for?

SREs, DevOps teams, and platform engineers responsible for maintaining the reliability and performance of backend services.

What problem does this solve?

Manually checking service health, recent deployments, and performance indicators across multiple systems is time-consuming and error-prone.

This runbook solves:

  • Lack of centralized diagnostic automation
  • Delays in identifying critical issues
  • Inconsistent on-call communication during incidents

What this workflow accomplishes

  • Confirms service health via HTTP check
  • Checks for deployments in the past 20 minutes
  • Audits CPU and memory usage via Datadog and Kubernetes
  • Identifies latency issues and error log patterns
  • Summarizes results and alerts via Slack

Integrations

This runbook uses the following integrations:

  • PagerDuty logoPagerDuty Agent: Retrieves incident and alert details
  • GitHub logoGitHub Agent: Verifies recent deployments
  • Datadog logoDatadog Agent: Gathers metrics and logs
  • Kubernetes logoKubernetes Agent: Monitors pod resource consumption
  • Slack logoSlack Agent: Sends structured reports to alert teams

Setup

  • PagerDuty:
    • API token with read access to incidents and services
    • Required permissions: Read incidents, services, and escalation policies
  • GitHub:
    • Access token with deployment read permissions
  • Datadog:
    • API key and App key
    • Required scopes: Metrics and Logs read access
  • Kubernetes:
    • Bearer Token - A long-lived ServiceAccount Token
    • Cluster CA Certificate - The cluster’s root certificate for TLS verification
    • API Server URL - The Kubernetes cluster endpoint URL
    • Access to the specific namespace
  • Slack:
    • Bot token with chat:write scope

Runbook Template

📚 runbook.mdx
Runbook

Objective: Run health and performance diagnostics for service {{name}} (you can also do all services).

Steps:

(1) Retrieve PagerDuty Incident Details

  • Use the PagerDuty logoPagerDuty Agent to fetch incident details for service {{service}}.
  • Gather information about:
    • Current incident status and priority
    • Time since incident creation
    • Assigned escalation policy
    • Recent incident history for the service

(2) Verify Service Health

  • Execute a GET request using the HTTP Agent to:
{{service_url}}/healthy

Confirm that the service returns a healthy status.

(3) Fetch Recent Deployments (Github)

  • Query for deployments of the {{service}} service using the GitHub logoGitHub Agent within the last 20 minutes.

(4) Analyze Resource Usage (Datadog)

  • Check CPU and Memory usage of the deployed application using the Datadog logoDatadog Agent over the past 15 minutes. Ensure values are below 80% for both metrics.

(5) Kubernetes Resource Usage (Kubernetes)

  • Retrieve memory and CPU usage of all pods in the {{service}} namespace using the Kubernetes logoKubernetes Agent. Highlight any pods nearing resource limits.

(6) Latency Check (Datadog)

  • Run the following query to detect latency spikes in the last 30 minutes using the Datadog logoDatadog Agent:
trace.go.opentelemetry.io_contrib_instrumentation_github.com_gin_gonic_gin_otelgin.server{service:{{service}}}

Evaluate average and peak latency values.

(7) Top Error Logs (Datadog)

  • Use this query to retrieve top error logs from the last 15 minutes using the Datadog logoDatadog Agent: status:error service:{{service}}

(8) Full Error Log Analysis (Datadog)

  • Check all error logs with the following query: service:{{service}} status:error

  • Review for any recurring or critical error patterns.

(9) Slack Notification

  • Post a message to #on-call tagging @team-oncall with the following structure:
📘 Runbook Executed: {{service}} Service Diagnostics

📅 Date & Time: {{execution_time}}
🔧 Service: {{service}}
🌐 Environment: {{environment_name}}
📁 Repository: {{service}}
🚨 PagerDuty Incident: {{incident_id}} ({{incident_status}})

Summary:
All diagnostic steps completed.

Detailed Steps:
- PagerDuty Analysis - ✅ Incident details retrieved
- Health Check - ✅ Healthy response from endpoint
- Deployment Check - ✅ Deployment found within last 20 min
- Datadog Resource Usage - ✅ Below threshold
- Kubernetes Pod Usage - ✅ No anomalies
- Latency - ✅ No spikes detected
- Top Error Logs - ✅ No critical issues
- Log Analysis - ✅ Clean

Result: ✅ SUCCESS
  • For FAILURE or CRITICAL, update result line accordingly and include this footer:
:alert: @team-oncall please check the runbook results for the runbook that was executed.

Alexis Warner

Marketing

May 30, 2025

5 min read

Categories

    devops

    observability

    monitoring

    alerts

    slack

    github

    datadog

About this post

Alexis Warner

Marketing

Last updated: May 30, 2025

5 min read

Agents Used

PagerDuty logoPagerDuty AgentHTTP AgentGitHub logoGitHub AgentDatadog logoDatadog AgentKubernetes logoKubernetes AgentSlack logoSlack Agent

Categories

    devops

    observability

    monitoring

    alerts

    slack

    github

    datadog

Follow us

Product

IntegrationsUse Cases

2025 © Bearify All Rights Reserved

Terms of ServicePrivacy Policy