Support¶
Platform Engineering includes five engineering teams, each responsible for the health and maintenance of several systems. These systems range in criticality, and each team provides support accordingly.
Business Hours: Monday-Friday, 9:00 AM to 8:00 PM ET
Business hours support:
During business hours, or for issues that are Moderate
severity or lower, create a support request. Platform Engineering teams handle these types of requests during normal business hours.
Non-business hours support:
When an incident occurs outside of business hours, please page the appropriate team in Opsgenie. To determine which Platform Engineering schedule to page, see Teams, schedules, and supported systems
Note
We understand that situations may occur that require exceptions. If a lower severity issue or system is blocking incident resolution, don't hesitate to page the appropriate team.
To create an alert in Opsgenie:
- Click the Alerts tab in Opsgenie.
- Click Create alert.
- Provide a detailed message and set the
Responder
to the appropriate Opsgenie team. - Click Create.
Incident severity levels and support process¶
Severity | Description | Support Window | Contact Method |
---|---|---|---|
Critical | Widespread, complete loss of functionality, or severe or persistent performance degradation. |
24/7 | Business hours: Open a support request in your cloud-platform-<workload> Slack channel. Non-business hours: Create an alert in Opsgenie to page the on-call engineer. |
High | Substantial reduction in performance or impacts multiple users. |
24/7 | Business hours: [Open a Support Request] in your cloud-platform-<workload> Slack channel. Non-business hours: Create an alert in Opsgenie to page the on-call engineer. |
Moderate | Potential instability or moderate reduction in performance. |
Business hours | Open a support request in your cloud-platform-<workload> Slack channel |
Low | Inconvenient, but doesn't prevent users from continuing to work. |
Business hours | Open a support request in your cloud-platform-<workload> Slack channel |
Feature Requests | Suggestions, requests for improvements, or feedback on existing features. Please use for general inquires, such as workload access. Has no impact on performance or existing functionality. | Business hours | Open a feature request |
Teams, schedules, and supported systems¶
Escalate directly to our partners
In case of emergencies where critical (internal or external) partner services are in a degraded or failing state:
- AWS: please open a support case with AWS and escalate to the TAMs through #ext-abc-aws-collab.
- Datadog: please escalate directly to #ext-company-datadog.
Otherwise, page the schedule(s) below as indicated in these instructions.
Cloud Foundation¶
- Opsgenie Team: Platform Engineering Foundations
System | Severity | Impact |
---|---|---|
IAM Permissions | High | Inability to deploy, rollback, or alter production-impacting features & infrastructure |
AWS Network | High | Core network-as-service infrastructure stability issues (DNS, ENI Counts, VPC-related issues) that are production-impacting |
IAM Permissions | Low | Inability to build, deploy, or provision resources for non-production-impacting changes and features |
Delivery¶
- Opsgenie Team: Platform Engineering Delivery
System | Severity | Impact |
---|---|---|
GitHub Actions | High | Inability to perform builds or deployments |
Artifactory | High | Inability to builds or deploy new versions |
Base Images | Low | Build or deployment failures |
DevEx¶
- Opsgenie Team: Platform Eng DevEx
System | Severity | Impact |
---|---|---|
SDE | High | Inability to login to AWS from workstation |
CP Docs | Moderate | Inability to access Cloud Platform documentation, including troubleshooting guides |
SonarQube | Low | Build failures for services running quality checks |
Sourcegraph | Low | Inability for some (internal) services and engineers to execute code search queries |
ghe-team-sync | Low | Users lacking desired permissions in GitHub |
Rancher Desktop | Low | Inability to run containers on workstations |
Enablement¶
- Opsgenie Team: Platform Engineering Enablement
System | Severity | Impact |
---|---|---|
Region Evacuation | High | PSO is unable to execute region evacuation |
JWT Key Management | High | Many core services depend on this system to retrieve public and private JWTs. Failure will manifest as issues downloading JWTs or deployment failures on some user services, core services and commerce-services. |
IaC libraries | Low | Deployment failures related to outdated or incorrect use of abc-workloads , abc-cdk-constructs , or abc-cdk-lib libraries |
Slack support bot | Low | Engineering teams unable to open Cloud Platform support requests |
Observability¶
- Opsgenie Team: Platform Engineering Observability
System | Severity | Impact |
---|---|---|
Datadog | High | Loss of observability, monitoring, and alerting capabilities |
GitHub¶
Note
This is for system failures of the type described in Impact below. When necessary, page Delivery or DevEx for other GitHub-related topics.
- Opsgenie Team: PE-GitHub-Support
System | Severity | Impact |
---|---|---|
GitHub | Critical | Engineers unable to pull, push, or review code. Engineers unable to perform builds or deployments. |
Additional resources