Cardinality Explosion¶

In terms of Observability (O11y), a Cardinality Explosion occurs when a metric or combination of metrics generates an excessive number of custom metrics. Excessive metrics increase cost, strain resources, and can impact system responsiveness.

Key terms:

O11y - Shorthand for observability
Data Dimension - Tracked metric for an application
Cardinality - The number of tag values associated with a tag key for a metric
Tag - Combination of metrics; sometimes called a label in O11y products

Datadog custom metrics and billing¶

If a metric is not submitted from a Datadog integration, it’s considered a custom metric. A custom metric is identified by a unique combination of a metric’s name and tag values (including the host tag). In general, metrics sent using DogStatsD or through a custom Agent Check are considered custom metrics.

Warning

Your count of custom metrics scales with the most granular or detailed tag. Extremely granular custom metrics increase Datadog billing charges.

Example

You have a number_of_employees metric that counts the number of employees in an organization. This metric has department and region tags. These tags are considered low cardinality because of the broad nature of their classification. If you add an extremely granular tag, such as employee_id, to that same metric, the metric cardinality becomes unbounded and results in a Cardinality Explosion. Because of the high cardinality of the employee_id tag, Datadog billing for custom metrics increases.

For more examples, see the Custom metrics billing topic from Datadog.

Best practices for custom metrics within Datadog¶

Mitigate or avoid high cardinality metrics:

Roll-up attributes - Narrow high cardinality metrics to a broader category.
Remove unnecessary metrics - Remove metrics that do not add value.

Tip

If your dashboard tracks so many metrics that it's hard to use, you probably have too much cardinality.
Monitor - Have a monitoring system in place that scans for metrics showing a significant increase in total labels.
Avoid metrics that can grow without bounds - Don't use metrics if you can't strictly limit the unique values, such as user IDs, IP addresses, or user input. Instead, record these by logging.
Be mindful when adding new metrics - Multiply your existing labels by the number of known values for a metric to determine whether it will significantly increase the label count.

Datadog's Metrics without Limits¶

Datadog has a feature for managing tracked metrics, which is available on the Metrics Summary page.

Tagging metrics

This functions as a roll-up mitigation strategy and allows you to group metrics and treat them as a single item. Default behavior is to GROUP BY SUM & ROLLUP BY SUM.
Retrieval of previously tagged metrics

If it becomes important to view a metric that was previously modified by tagging, it can be reindexed in the Manage Tags window.
Limit metrics using the Datadog Agent

The Datadog Agent can be supplied with a configuration file to modify tracked metrics before submitting to Datadog. See the Datadog documentation for more information.
Be mindful of "Generate New Metric"

The Logs view and Traces view each have a Generate New Metric button. This has the same implication as adding a new custom metric in an application based on your current search parameters, including billing as custom metrics. You can create Custom Analytics within the Analytics view. This is useful for temporary issues that you'll use only for a short term, like days to weeks.

Cardinality explosion scenarios¶

Various combinations of metrics can multiply exponentially and lead to two scenarios that cause a Cardinality Explosion.

Scenario 1: Single High Cardinality Data Dimension

2 servers running 2 applications with 4 different status codes (low cardinality metrics)

Multiply together for possible combinations:

2 (servers) x 2 (applications) x 4 (status codes) = 16 possible tags for your O11y application to track

Now track each user's metrics individually, with a user base of 100 unique individuals (high cardinality metric)

2 (servers) x 2 (applications) x 4 (status codes) x 100 (users) = 1600 possible tags

One high cardinality metric added 1584 new tags to the O11y application.

Now increase the user base from 100 to 10,000 because our application launched in a new region

2 (servers) x 2 (applications) x 4 (status codes) x 10000 (users) = 160,000 possible tags

Then increase the number of servers from 2 to 8 to accommodate the load from new users

8 (servers) x 2 applications x 4 status codes x 10000 users = 640,000 possible tags

The O11y application must now track 640,000 possible tags

Scenario 2: Excessive Number of Low Cardinality Data Dimensions

2 servers running 2 applications with 4 status codes

2 (servers) x 2 (applications) x 4 (status codes) = 16 possible tags for your O11y application to track

Now add a metric that tracks which endpoints are being hit (5 endpoints) and which of 5 feature toggles are active. (Toggles a, b, c, d, and e, which are boolean values.)

2 x 2 x 4 x 5 (endpoints) x 2 (toggle a) x 2 (toggle b) x 2 (toggle c) x 2 (toggle d) x 2 (toggle e) = 2560 tags

The boolean values alone added a multiplier of 2^5 to the total number of tags

Additional resources