Skip to content

Observability as Code

The Observability as Code Suite (OaC) is a solution for generating reliability dashboards, managing performance alerts, and handling incident response. It's managed by the abc.manifest.yml file and utilizes Datadog, OpsGenie, and Slack.

Key benefits:

  • Seamless Integration: Implement in 30 minutes or less, with support available.
  • Real-Time Monitoring: Automatically generate dashboards for key metrics and the ability to drill into performance logs.
  • Alerting: Receive notifications when Service Level Objectives (SLOs) are breached, among other conditions.
  • On-Call Scheduling: Set up team member support rotations for incident response, which is managed automatically and dynamically via Slack channels.
  • Canary Dashboards: Monitoring for these 10% traffic tests is auto-created as separate dashboards.

Components

Currently, the OaC Suite is comprised of the following components, and will continue to evolve over time:

  • Manifest-driven Pipeline: A versioned and validated file serves as the single source of truth and will be embedded in your team’s GitHub repository. The Manifest file drives the Observability-as-Code (OaC) pipeline, which integrates with the PLATSVCS stack. This replaces app-config.json and gives you a central file to manage all key config variables and update all things observability.

  • Extended Dashboards: Customized views of your entire web of interdependent resources and services (via Datadog). Gain access to live performance visualizations for Lambdas, Fargates, Kinesis streams, OpenSearch, Stripe, Contentful, and more. Extended Dashboards require definition of your dependencies and implementation of service tags following guidelines established in ADR 67: Required Tags for AWS Resources.

  • Alerts & Escalations: This component centralizes management of all alerts, manages escalations based on severity, and will be extended over time to include more customizability. Notifications in a variety of forms will help you track issues in real-time and over extended periods, and initially include the following:

    • Four default monitors
    • Two default service-level objectives (SLOs)
    • A rotating on-call schedule for incident response, managed dynamically in Slack channels (via OpsGenie).
  • Canary Dashboards: Canned dashboards to monitor and troubleshoot canary deployments (10% traffic deploys). Driven by top-level drop-downs for environment, region, and versions, you can observe the performance of your deployments, compare them to previous versions, keep an out for alarms, and monitor dependencies for upstream issues. These have the same requirements as Extended Dashboards; the remainder of the requirements are configured by app and observability.type.