Skip to content

bearify

bearify
Home
Use Cases

Service Health & Deployment Diagnostics

github logo

github

datadog logo

datadog

kubernetes logo

kubernetes

slack logo

slack

Automate health checks, resource diagnostics, and deployment visibility for your services. Get real-time insights and on-call notifications via Slack.

TL;DR

This runbook automates a full diagnostic sweep for your services — including health checks, deployment visibility, resource monitoring, latency, and error rate analysis. After execution, it posts a detailed summary to Slack for on-call awareness.

Who is this for?

Site Reliability Engineers (SREs), platform engineers, and DevOps professionals responsible for infrastructure uptime, performance monitoring, and incident response.

What problem does this solve?

Without automation, monitoring health, deployment activity, and performance metrics across multiple platforms (Datadog, Kubernetes, GitHub) can be slow and error-prone.

This runbook solves:

  • Manual health verification across tools
  • Lag in detecting performance issues
  • Delays in team notification during failures or anomalies

What this workflow accomplishes

  • Verifies API health of your services
  • Checks for new deployments within the past 20 minutes
  • Audits CPU and memory usage using Datadog and Kubernetes
  • Identifies latency and error rate anomalies
  • Scans recent logs for application errors
  • Sends structured summary reports to a Slack channel with on-call alerts

Integrations

This runbook uses the following integrations:

  • GitHub logoGitHub Agent: Checks for recent deployments
  • Datadog logoDatadog Agent: Collects resource usage, latency, and error metrics
  • Kubernetes logoKubernetes Agent: Retrieves pod-level resource usage
  • Slack logoSlack Agent: Sends detailed runbook results and alerts

Setup

  • GitHub:
    • Repo access token with deployment read permissions
  • Datadog:
    • Valid API key and App key
    • Required scopes: Metrics read, Logs read
  • Kubernetes:
    • Bearer Token - A long-lived ServiceAccount Token
    • Cluster CA Certificate - The cluster’s root certificate for TLS verification
    • API Server URL - The Kubernetes cluster endpoint URL
    • Access to query pod metrics in the workspaces namespace
  • Slack:
    • Bot token with chat:write permission
    • Channel ID: #random

Runbook Template

📚 runbook.mdx
Runbook

Objective: Automate deployment verification for {{service}} (can be all services) and send detailed summary reports to Slack.

Steps:

(1) Verify Service Health

  • Execute a GET request using the HTTP Agent to:
{{service_url}}/healthy

Confirm that the service returns a healthy status.

(2) Fetch Recent Deployments (Github)

  • Query for deployments of the {{service}} service using the GitHub logoGitHub Agent within the last 20 minutes.

(3) Analyze Resource Usage (Datadog)

  • Check CPU and Memory usage of the deployed application using the Datadog logoDatadog Agent over the past 20 minutes. Ensure values are below 80% for both metrics.

(4) Kubernetes Resource Usage (Kubernetes)

  • Retrieve memory and CPU usage of all pods in the {{service}} namespace using the Kubernetes logoKubernetes Agent. Highlight any pods exceeding resource limits.

(5) Latency Check (Datadog)

  • Run the following query to detect latency spikes in the last 20 minutes using the Datadog logoDatadog Agent:
trace.go.opentelemetry.io_contrib_instrumentation_github.com_gin_gonic_gin_otelgin.server{service:{{service}}}

Check for latency anomalies or spikes over the last 20 minutes.


(6) Error Rate Analysis (Datadog)

  • Use this query to retrieve top error logs from the last 20 minutes using the Datadog logoDatadog Agent: status:error service:{{service}}

Monitor error rates and identify any recent spikes.


(7) Log Analysis (Datadog)

  • Check all error logs with the following query: service:{{service}} status:error

Look for critical or recurring error messages.


(8) Slack Notification

  • Send a message to Slack channel #on-call, tagging @team-oncall with the following format:
🚨 Runbook Executed: {{service}} Service Diagnostics

📅 Date & Time: {{execution_time}}
🔧 Service: {{service}}
🌐 Environment: {{environment_name}}
📁 Repository: {{service}}

Summary:
All diagnostic checks have been completed.

Detailed Steps:
-Health Check - ✅ Passed
- Deployment Activity - ✅ Found recent deployment
- Datadog Resource Usage - ✅ CPU/Memory below threshold
- Kubernetes Resource Usage - ✅ All pods within safe range
- Latency Metrics - ✅ No spikes
- Error Rate - ✅ Normal
- Logs - ✅ No recent critical errors

Result: ✅ SUCCESS
  • If failures or anomalies are detected, adjust Result to:
    • ❌ FAILURE
    • ⚠️ CRITICAL

Add footer:

:alert: @team-oncall please review the runbook execution results.

Alexis Warner

Marketing

May 30, 2025

5 min read

Categories

    devops

    observability

    diagnostics

    alerts

    slack

    github

    datadog

About this post

Alexis Warner

Marketing

Last updated: May 30, 2025

5 min read

Agents Used

HTTP AgentGitHub logoGitHub AgentDatadog logoDatadog AgentKubernetes logoKubernetes AgentSlack logoSlack Agent

Categories

    devops

    observability

    diagnostics

    alerts

    slack

    github

    datadog

Follow us

Product

IntegrationsUse Cases

2025 © Bearify All Rights Reserved

Terms of ServicePrivacy Policy