Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric Math Alarms do not adhere to minMetricSamplesToAlarm #403

Open
akiesler opened this issue Aug 3, 2023 · 3 comments
Open

Metric Math Alarms do not adhere to minMetricSamplesToAlarm #403

akiesler opened this issue Aug 3, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@akiesler
Copy link

akiesler commented Aug 3, 2023

Version

5.3.4

Steps and/or minimal code example to reproduce

  1. Create a monitoring facade
  2. Create a MathExpression using MetricFactory.createMetricMath()
  3. Create an alarm using AlarmFactory.addAlarm()

Expected behavior

I would expect the corresponding "NoSamples" alarm to either check each metrics has the necessary number of samples or wrap the entire expression in DATAPOINT_COUNT to check for the correct sample count.

Actual behavior

The same metric alarm is created because there is no statistic prop to update.

Other details

No response

@akiesler akiesler added the bug Something isn't working label Aug 3, 2023
@akiesler
Copy link
Author

akiesler commented Aug 4, 2023

For anyone else facing this same issue I was able to hack together this solution.

import { Alarm, CfnAlarm, MathExpression } from 'aws-cdk-lib/aws-cloudwatch';
import { MonitoringFacade } from 'cdk-monitoring-constructs';

// Hack the underlying CFN types to be mutable so we can modify them.
type MetricDataQueryProperty = {
    -readonly [K in keyof CfnAlarm.MetricDataQueryProperty]: CfnAlarm.MetricDataQueryProperty[K];
};

type MetricStatProperty = {
    -readonly [K in keyof CfnAlarm.MetricStatProperty]: CfnAlarm.MetricStatProperty[K];
};

export function fixMetricMathMinSamples(monitoringFacade: MonitoringFacade) {
    // Find all minimum samples metric math alarms
    const alarms = monitoringFacade.node.children
        .filter((c) => c instanceof Alarm)
        .map((alarm) => alarm as Alarm)
        .filter((alarm) => alarm.node.id.includes('NoSamples') && alarm.metric instanceof MathExpression);

    alarms.forEach((alarm) => {
        const metrics = (alarm.node.defaultChild! as CfnAlarm).metrics! as MetricDataQueryProperty[];
        // Convert Metric Math to use the MAX Sample Count
        metrics.filter((m) => m.expression).forEach((m) => (m.expression = 'MAX(METRICS())'));
        // Convert Metrics to using SampleCount
        metrics.filter((m) => m.metricStat).forEach((m) => ((m.metricStat as MetricStatProperty).stat = 'SampleCount'));
    });
}

Would love to see the logic fixed in the library itself.

@echeung-amzn
Copy link
Member

This is where minMetricSamplesToAlarm is currently handled: https://github.com/cdklabs/cdk-monitoring-constructs/blob/main/lib/common/alarm/AlarmFactory.ts#L575-L607

We create an additional alarm based on a sample count variation of the provided metric that's then used in a composite alarm.

I'm not entirely sure how to best solve for this. Potentially the most straightforward naive solution would be to create a noSamplesAlarm equivalent for every underlying metric (e.g., Object.values(adjustedMetric.usingMetrics)), then use all of those in the composite alarm. There's a risk that we'd hit the limit for a composite alarm though.

On the other hand, DATAPOINT_COUNT is not recommend for use in alarms as per the docs:

We recommend that you do not use this function in CloudWatch alarms. Whenever an alarm evaluates whether to change state, CloudWatch attempts to retrieve a higher number of data points than the number specified as Evaluation Periods. This function acts differently when extra data is requested.

Considering you provided the above hacky solution, did you have any opinion @akiesler ?

@akiesler
Copy link
Author

akiesler commented Sep 7, 2023

I think the devil is in the details. Do we know how the function behaves differently and went what extra data is requested?

This function acts differently when extra data is requested.

My own experience has been that using the DATAPOINT_COUNT hasn't resulted in any significant issues but I wouldn't want to say it will work for all users.
I think that it would be a viable stop-gap to ensure customers are getting at least some protection while we investigate the long-term solution.

echeung-amzn pushed a commit that referenced this issue Nov 10, 2023
Fixes #452

Currently, when using `minMetricSamplesToAlarm` the number of samples is
evaluated for a different period than the main alarm. This makes
monitoring sensitive to false positives as not every breaching datapoint
must have sufficient number of samples (see #452 for more details).

Moreover, the current approach for adjusting alarms to respect
`minMetricSamplesToAlarm` is to create 2 extra alarms - one for
`NoSamples` and one for a top-level composite. Each of these monitors
incurs extra costs ($0.10 for `NoSamples` monitor and $0.50 for the
Composite, see https://aws.amazon.com/cloudwatch/pricing/ for
reference). This means that using `minMetricSamplesToAlarm` increases
the cost from $0.10 per alarm to $0.70 per alarm ($0.60 of overhead!).

It's possible to use Math Expression instead. Instead of adding separate
alarm for `NoSamples`, we can model it a Sample Count metric, and
instead of the Composite, we can use the MathExpression that
conditionally emits the data based on the number of samples. The charge
for Math Expression-based alarms is per metric in the Math Expression,
so that comes down to $0.20 per alarm. That's a 70% cost improvement.
Additionally, it reduces the overall number of alarms, effectively
making it easier to fit your alarming in the CloudWatch quota and
decluttering the UI.

To avoid breaking any customers that rely on `minMetricSamplesToAlarm`
generating alarms (e.g.
#403),
deprecating it and adding `minSampleCountToEvaluateDatapoint` with
updated behaviour next to it.

---

_By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache-2.0 license_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants