Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate flaky test-crypto-timing-safe-equal-benchmarks #38226

Closed
Trott opened this issue Apr 13, 2021 · 22 comments
Closed

investigate flaky test-crypto-timing-safe-equal-benchmarks #38226

Trott opened this issue Apr 13, 2021 · 22 comments
Labels
crypto Issues and PRs related to the crypto subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI.

Comments

@Trott
Copy link
Member

Trott commented Apr 13, 2021

pummel/test-crypto-timing-safe-equal-benchmarks seems to fail frequently on LinuxONE. That's often a sign that a test (or some Node.js internal code) is not prepared to run on very fast CPUs. (In CI, our LinuxONE host is very fast and often finds race conditions of the "oh, I didn't expect it to run that quickly" variety).

Is this a bug in crypto? A bug in the test? Something else?

@nodejs/crypto @nodejs/testing

Doesn't seem to be a LinuxONE platform team so pinging a few IBM/Red Hat folks who are active in the Build team: @richardlau @AshCripps @mhdawson Probably not the exact right people, but they probably know who to loop in.

  • Test: test-crypto-timing-safe-equal-benchmarks
  • Platform: linuxone
  • Console Output:
00:09:15 not ok 2944 pummel/test-crypto-timing-safe-equal-benchmarks
00:09:18   ---
00:09:18   duration_ms: 2.136
00:09:18   severity: fail
00:09:18   exitcode: 1
00:09:18   stack: |-
00:09:18     node:assert:402
00:09:18         throw err;
00:09:18         ^
00:09:18     
00:09:18     AssertionError [ERR_ASSERTION]: timingSafeEqual should not leak information from its execution time (t=5.715867808741522)
00:09:18         at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel7-s390x/test/pummel/test-crypto-timing-safe-equal-benchmarks.js:109:1)
00:09:18         at Module._compile (node:internal/modules/cjs/loader:1108:14)
00:09:18         at Object.Module._extensions..js (node:internal/modules/cjs/loader:1137:10)
00:09:18         at Module.load (node:internal/modules/cjs/loader:988:32)
00:09:18         at Function.Module._load (node:internal/modules/cjs/loader:828:14)
00:09:18         at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:76:12)
00:09:18         at node:internal/main/run_main_module:17:47 {
00:09:18       generatedMessage: false,
00:09:18       code: 'ERR_ASSERTION',
00:09:18       actual: false,
00:09:18       expected: true,
00:09:18       operator: '=='
00:09:18     }
00:09:18   ...
@Trott Trott added crypto Issues and PRs related to the crypto subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI. labels Apr 13, 2021
@Trott
Copy link
Member Author

Trott commented Apr 29, 2021

This has now started failing somewhat consistently on LinuxONE in CI. I don't know if that means the test just isn't reliable or if it means there's a timing issue with faster processors or what. @nodejs/crypto

not ok 2961 pummel/test-crypto-timing-safe-equal-benchmarks
  ---
  duration_ms: 2.12
  severity: fail
  exitcode: 1
  stack: |-
    node:assert:412
        throw err;
        ^
    
    AssertionError [ERR_ASSERTION]: timingSafeEqual should not leak information from its execution time (t=7.219867083599476)
        at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-linuxone/nodes/rhel7-s390x/test/pummel/test-crypto-timing-safe-equal-benchmarks.js:109:1)
        at Module._compile (node:internal/modules/cjs/loader:1109:14)
        at Object.Module._extensions..js (node:internal/modules/cjs/loader:1138:10)
        at Module.load (node:internal/modules/cjs/loader:989:32)
        at Function.Module._load (node:internal/modules/cjs/loader:829:14)
        at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:76:12)
        at node:internal/main/run_main_module:17:47 {
      generatedMessage: false,
      code: 'ERR_ASSERTION',
      actual: false,
      expected: true,
      operator: '=='
    }
  ...

https://ci.nodejs.org/job/node-test-commit-linuxone/nodes=rhel7-s390x/26963/consoleText

@Trott
Copy link
Member Author

Trott commented Apr 29, 2021

Stress tests for bisecting:

Bisecting: 70 revisions left to test after this (roughly 6 steps)
[4f11b8b] doc: fix typo in buffer.md

Bisecting: 35 revisions left to test after this (roughly 5 steps)
[f27b7cf] fs: aggregate errors in fsPromises to avoid error swallowing

Bisecting: 17 revisions left to test after this (roughly 4 steps)
[87aca07] doc: clarify that fs.Dir async iterator closes automatically

Bisecting: 8 revisions left to test after this (roughly 3 steps)
[896e5af] tools: remove fixer for non-ascii-character ESLint custom rule

Bisecting: 4 revisions left to test after this (roughly 2 steps)
[1adcae9] doc: add try/catch in http2 respondWithFile example

Bisecting: 1 revision left to test after this (roughly 1 step)
[7919ced] lib: harden lint checks for globals

Bisecting: 0 revisions left to test after this (roughly 0 steps)
[4af15df] src: fix validation of negative offset to avoid abort

7919ced is the first bad commit
commit 7919ced
Author: Antoine du Hamel duhamelantoine1995@gmail.com
Date: Mon Apr 26 18:03:54 2021 +0200

lib: harden lint checks for globals

PR-URL: https://github.com/nodejs/node/pull/38419
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Darshan Sen <raisinten@gmail.com>

:040000 040000 8080a76b162a40209ec5b9397db3945fa36ee143 247e5c50613229df3f9776ce9637237146682dd0 M

@Trott
Copy link
Member Author

Trott commented Apr 29, 2021

I know they're not active in this repo anymore, but ping @not-an-aardvark: Hey, does this look like an issue with the crypto safe timing test? Or does it look like a bug? The machine that it's failing on is very fast, if that helps. Would increasing numTimes from 1e5 to a larger value help or does that defeat the purpose of the test?

@not-an-aardvark
Copy link
Contributor

not-an-aardvark commented Apr 29, 2021

From looking at a few of the failing cases, it seems like the error message is always "timingSafeEqual should not leak information from its execution time (t=<some positive number>)". Assuming that the test is working properly, this would be a signal that timingSafeEqual is consistently running faster for equal inputs than for unequal inputs (at least on some hardware/builds), which would be bad and might be worth investigating. That said, in the past this test has had various false positives due to artifacts of the test setup itself.

The test fails if it determines that a timing difference is statistically significant. In general, increasing numTrials would make the test more sensitive to small timing differences (i.e. maybe even more likely to fail), since it would have a larger dataset to do statistical analysis on.

In this case, it seems like the t-values are routinely 9 and above, which wouldn't plausibly happen by chance. So we can be pretty confident that it's either (a) a legitimate bug in crypto.timingSafeEqual, or (b) a bug in the test setup, which wouldn't be fixed by changing numTrials.


Incidentally, when we were originally implementing crypto.timingSafeEqual in #8040, we considered using a "double HMAC" implementation rather than CRYPTO_memcmp, but decided on CRYPTO_memcmp because we didn't want to leak the boolean condition of whether the comparison was successful.*

Since then, there have been significant advances in side-channel attacks (Meltdown/Spectre and others), which has led me to believe that double HMAC could be a good defense-in-depth measure against increasingly powerful compilers and CPU branch prediction that could optimize away CRYPTO_memcmp.

It might be worth updating crypto.timingSafeEqual to use both double HMAC and CRYPTO_memcmp (i.e. given buffers a and b, return the result of CRYPTO_memcmp(sha256hmac(randomKey, a), sha256hmac(randomKey, b))).

*I also have some doubts about whether it's really possible to avoid leaking the boolean result, due to branch prediction. This might also be why this test keeps failing. In any case, combining CRYPTO_memcmp and double HMAC wouldn't make anything worse in that regard.

@Trott
Copy link
Member Author

Trott commented Apr 29, 2021

Bisect results:

7919ced0c97e9a5b17e6042e0b57bc911d23583d is the first bad commit
commit 7919ced0c97e9a5b17e6042e0b57bc911d23583d
Author: Antoine du Hamel <duhamelantoine1995@gmail.com>
Date:   Mon Apr 26 18:03:54 2021 +0200

    lib: harden lint checks for globals
    
    PR-URL: https://github.com/nodejs/node/pull/38419
    Reviewed-By: James M Snell <jasnell@gmail.com>
    Reviewed-By: Darshan Sen <raisinten@gmail.com>

:040000 040000 8080a76b162a40209ec5b9397db3945fa36ee143 247e5c50613229df3f9776ce9637237146682dd0 M

@Trott
Copy link
Member Author

Trott commented Apr 29, 2021

The bisect results are certainly surprising. Going to run a bunch more tests to confirm them.

Stress tests against 7919ced:

Stress tests on 4af15df (the last change before the one above that is being tentatively blamed for the problem):

@Trott
Copy link
Member Author

Trott commented Apr 30, 2021

In this case, it seems like the t-values are routinely 9 and above, which wouldn't plausibly happen by chance.

When I increase numTimes to 1e6, I got a whopping t value of 37.871320438939634 which I imagine is not moving in the right direction.....

@Trott
Copy link
Member Author

Trott commented Apr 30, 2021

The bisect results are certainly surprising. Going to run a bunch more tests to confirm them.

Stress tests against 7919ced:

Stress tests on 4af15df (the last change before the one above that is being tentatively blamed for the problem):

Those sure seem consistent with 7919ced making the problem much worse, presumably as some indirect side effect. I wonder if the next step is to sorta bisect the individual line changes in that commit (since outside of the lint tool changes, they are mostly self-contained) to get to the bottom of which one (if it's indeed only one) is causing the issue.

@Trott
Copy link
Member Author

Trott commented Apr 30, 2021

So, I bisected on individual file changes and the one that is causing the issue is Trott@845adb7 which is totally puzzling to me, but maybe @not-an-aardvark or @aduh95 would have a reasonable explanation (assuming my results hold up to further scrutiny, which I will definitely be applying to confirm).

Trott added a commit to Trott/io.js that referenced this issue Apr 30, 2021
@Trott Trott reopened this Apr 30, 2021
Trott added a commit to Trott/io.js that referenced this issue Apr 30, 2021
Trott added a commit to Trott/io.js that referenced this issue Apr 30, 2021
Trott added a commit to Trott/io.js that referenced this issue Apr 30, 2021
Trott added a commit to Trott/io.js that referenced this issue Apr 30, 2021
Trott added a commit to Trott/io.js that referenced this issue Apr 30, 2021
Add HMAC to crypto.timingSafeEqual(). This makes things slower but also
makes them far more timing safe.

Refs: nodejs#38226 (comment)
Fixes: nodejs#38226
targos pushed a commit that referenced this issue May 3, 2021
Fixes: #38226

PR-URL: #38476
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Ruy Adorno <ruyadorno@github.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
@mhdawson
Copy link
Member

@Trott should we exclude the failing test until we resolve the issue? Sounds like you are on top of the PRs that might fix it, but seems like it may still take a little while until there is agreement on the fix?

I'd be happy to open a PR to exclude for now.

@mhdawson
Copy link
Member

What I meant is mark flaky

@mhdawson
Copy link
Member

mhdawson commented Sep 9, 2021

Failing consistently when testing this PR: #39977 (comment)

@mhdawson
Copy link
Member

mhdawson commented Sep 9, 2021

@Trott I'm happy to submit a PR to mark as flaky if the fix you mention is not going to land soon. Any objections to that.

@richardlau
Copy link
Member

@richardlau
Copy link
Member

Opened a PR to mark the test as flaky: #40065

nodejs-github-bot pushed a commit that referenced this issue Sep 12, 2021
PR-URL: #40065
Refs: #38226
Reviewed-By: Ash Cripps <acripps@redhat.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Michael Dawson <midawson@redhat.com>
Reviewed-By: Tobias Nießen <tniessen@tnie.de>
aduh95 pushed a commit that referenced this issue Sep 12, 2021
PR-URL: #40065
Refs: #38226
Reviewed-By: Ash Cripps <acripps@redhat.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Michael Dawson <midawson@redhat.com>
Reviewed-By: Tobias Nießen <tniessen@tnie.de>
BethGriggs pushed a commit that referenced this issue Sep 21, 2021
PR-URL: #40065
Refs: #38226
Reviewed-By: Ash Cripps <acripps@redhat.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Michael Dawson <midawson@redhat.com>
Reviewed-By: Tobias Nießen <tniessen@tnie.de>
BethGriggs pushed a commit that referenced this issue Sep 21, 2021
PR-URL: #40065
Refs: #38226
Reviewed-By: Ash Cripps <acripps@redhat.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Michael Dawson <midawson@redhat.com>
Reviewed-By: Tobias Nießen <tniessen@tnie.de>
Trott added a commit to Trott/io.js that referenced this issue May 27, 2022
Add HMAC to crypto.timingSafeEqual(). This makes things slower but also
makes them far more timing safe.

Refs: nodejs#38226 (comment)
Fixes: nodejs#38226
Trott added a commit to Trott/io.js that referenced this issue May 28, 2022
Add HMAC to crypto.timingSafeEqual(). This makes things slower but also
makes them far more timing safe.

Refs: nodejs#38226 (comment)
Fixes: nodejs#38226
@tniessen
Copy link
Member

I tried rewriting the test to avoid branching within the benchmark entirely, to get rid of V8 optimizations etc. It only benchmarks calls that return false because there is no guarantee that returning true is going to take exactly the same time. (Also, this is usually what matters for security: the timing should give no information about what part of the buffer was incorrect.)

diff --git a/test/pummel/test-crypto-timing-safe-equal-benchmarks.js b/test/pummel/test-crypto-timing-safe-equal-benchmarks.js
index 3c6e31e829..db8f85169e 100644
--- a/test/pummel/test-crypto-timing-safe-equal-benchmarks.js
+++ b/test/pummel/test-crypto-timing-safe-equal-benchmarks.js
@@ -9,61 +9,60 @@ if (!common.enoughTestMem)
 const assert = require('assert');
 const crypto = require('crypto');
 
-function runOneBenchmark(compareFunc, firstBufFill, secondBufFill, bufSize) {
-  return eval(`
-      const firstBuffer = Buffer.alloc(bufSize, firstBufFill);
-      const secondBuffer = Buffer.alloc(bufSize, secondBufFill);
-
-      const startTime = process.hrtime();
-      const result = compareFunc(firstBuffer, secondBuffer);
-      const endTime = process.hrtime(startTime);
-
-      // Ensure that the result of the function call gets used, so it doesn't
-      // get discarded due to engine optimizations.
-      assert.strictEqual(result, firstBufFill === secondBufFill);
-
-      endTime[0] * 1e9 + endTime[1];
-    `);
-}
+const numTrials = 1e5;
+const bufSize = 16384;
+const readOnlyTestBuffer = crypto.randomBytes(bufSize);
 
 function getTValue(compareFunc) {
-  const numTrials = 1e5;
-  const bufSize = 10000;
   // Perform benchmarks to verify that timingSafeEqual is actually timing-safe.
 
-  const rawEqualBenches = Array(numTrials);
-  const rawUnequalBenches = Array(numTrials);
-
-  for (let i = 0; i < numTrials; i++) {
-    if (Math.random() < 0.5) {
-      // First benchmark: comparing two equal buffers
-      rawEqualBenches[i] = runOneBenchmark(compareFunc, 'A', 'A', bufSize);
-      // Second benchmark: comparing two unequal buffers
-      rawUnequalBenches[i] = runOneBenchmark(compareFunc, 'B', 'C', bufSize);
-    } else {
-      // Flip the order of the benchmarks half of the time.
-      rawUnequalBenches[i] = runOneBenchmark(compareFunc, 'B', 'C', bufSize);
-      rawEqualBenches[i] = runOneBenchmark(compareFunc, 'A', 'A', bufSize);
-    }
+  // Store all results in a single array and separate it later to avoid
+  // branching in the benchmark loop.
+  const measurements = Array(2 * numTrials).fill(0);
+  const testBuffer = Buffer.from(readOnlyTestBuffer);
+
+  // Run the actual benchmark. Avoid all branching (except the loop condition)
+  // to avoid conditional V8 optimizations.
+  let n = 0;
+  for (let i = 0; i < 2 * numTrials; i++) {
+    // Modify either the first or last byte of the copy of the test buffer.
+    const j = (bufSize - 1) * (i % 2);
+    testBuffer[j] ^= 1 | (i & 0xff);
+    // Call the comparison function and coerce the result into a number.
+    const startTime = process.hrtime.bigint();
+    n += compareFunc(testBuffer, readOnlyTestBuffer);
+    const endTime = process.hrtime.bigint();
+    measurements[i] = Number(endTime - startTime);
+    // Restore the original byte.
+    testBuffer[j] = readOnlyTestBuffer[j];
   }
 
-  const equalBenches = filterOutliers(rawEqualBenches);
-  const unequalBenches = filterOutliers(rawUnequalBenches);
+  // The comparison function should have returned false in every iteration, but
+  // we only check that here to avoid explicit branching above.
+  assert.strictEqual(n, 0);
+
+  // A simple comparison would be fast for even i and slow for odd i.
+  const rawFastBenches = measurements.filter((_, i) => i % 2 === 0);
+  const rawSlowBenches = measurements.filter((_, i) => i % 2 !== 0);
+  const fastBenches = filterOutliers(rawFastBenches);
+  const slowBenches = filterOutliers(rawSlowBenches);
 
   // Use a two-sample t-test to determine whether the timing difference between
   // the benchmarks is statistically significant.
   // https://wikipedia.org/wiki/Student%27s_t-test#Independent_two-sample_t-test
 
-  const equalMean = mean(equalBenches);
-  const unequalMean = mean(unequalBenches);
+  const fastMean = mean(fastBenches);
+  const slowMean = mean(slowBenches);
+
+  const fastLen = fastBenches.length;
+  const slowLen = slowBenches.length;
 
-  const equalLen = equalBenches.length;
-  const unequalLen = unequalBenches.length;
+  const combinedStd = combinedStandardDeviation(fastBenches, slowBenches);
+  const standardErr = combinedStd * Math.sqrt(1 / fastLen + 1 / slowLen);
 
-  const combinedStd = combinedStandardDeviation(equalBenches, unequalBenches);
-  const standardErr = combinedStd * Math.sqrt(1 / equalLen + 1 / unequalLen);
+  console.log(JSON.stringify({ rawFastBenches, rawSlowBenches, fastMean, slowMean, t: (fastMean - slowMean) / standardErr }));
 
-  return (equalMean - unequalMean) / standardErr;
+  return (fastMean - slowMean) / standardErr;
 }
 
 // Returns the mean of an array

The variable names fast and slow refer to a simple comparison function that would be fast when the start of the buffer does not match and slow when the end of the buffer does not match. We expect timingSafeEqual to take the same time in both cases, so we expect to find no statistically significant difference between fast and slow when the comparison function is timingSafeEqual.


Stress test on master: 44.9 % failed (build)
Stress test with this patch: 2.3 % failed (build)

Second stress test on master: 29.0 % failed (build)
Second stress test with this patch: 1.3 % failed (build)


So this seems to help, but the failure rate doesn't go down to 0. I added some logging and re-ran the stress test with the above patch (build). I copied the raw measurements (rawFastBenches and rawSlowBenches) from the logging data produced during the first failure (line 6517 in the console output of the build), here it is as a CSV file. (This is just one randomly chosen failure, so all of my findings below might be coincidental or wrong.)

The test script had reported t=15.965637751534858. R produces the same value:

d <- read.csv('fast-slow.csv')
t.test(d$fast, d$slow)
	Welch Two Sample t-test

data:  d$fast and d$slow
t = 15.966, df = 196380, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 55.84268 71.47212
sample estimates:
mean of x mean of y
 9680.232  9616.575

The test does include some code to remove "large outliers" from the measurements, that is, values that are at least 50 times larger than the mean.

max(d$fast) / mean(d$fast)
5.823827
max(d$slow) / mean(d$slow)
8.201985

So none of the measurements are close to being considered "large outliers" by the test.


But there certainly are outliers (which are not removed by the test), and there are certainly also large differences even between successive measurements.

# Relative difference between successive "fast" measurements, as percentages.
rel.diff <- 100 * abs(diff(d$fast)) / d$fast[1:99999]

quantile(rel.diff, probs = c(0.90, 0.99))
     90%      99%
11.68831 41.46492

This means that 1 in 10 measurements differed by at least 11.6 % from the previous measurement, and 1 in 100 measurements differed by at least 41.5 % from the previous measurement.


Plotting all 200,000 data points:

library(ggplot2)
library(reshape2)
library(dplyr)

d$index <- 1:nrow(d)
plottable <- melt(d, id.vars = "index", measure.vars = c("fast", "slow"))

ggplot(plottable, aes(index, value, colour = variable)) + geom_point(alpha = 0.4, size = 1)

plot-all-data-fast-slow

It doesn't look that bad. (Most of the fast data points are hidden behind the slow data points, there are 100,000 each.)

There are certainly more outliers in the first part. So let's try the t test again, using the second half of the graph only:

t.test(d$fast[50000:100000], d$slow[50000:100000])
	Welch Two Sample t-test

data:  d$fast[50000:1e+05] and d$slow[50000:1e+05]
t = 3.6058, df = 99338, p-value = 0.0003113
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  7.657665 25.896144
sample estimates:
mean of x mean of y 
 9589.558  9572.781 

The test uses T_THRESHOLD = 3.892, so t = 3.6058 is below the threshold. In other words, the second half of the benchmark would have passed the test, but barely. This might indicate some relation to just-in-time optimization (but note that the patched test seems to have a failure rate as low as 2 %).

100 * (mean(d$fast[1:50000]) / mean(d$fast[50000:100000]) - 1)
1.89088
100 * (mean(d$slow[1:50000]) / mean(d$slow[50000:100000]) - 1)
0.9149779

So the fast benchmark was about 2 % slower during the first half than during the second half, and the slow benchmark was about 1 % slower during the second half. This difference is statistically significant:

t.test(d$fast[1:50000], d$fast[50000:100000])
	Welch Two Sample t-test

data:  d$fast[1:50000] and d$fast[50000:1e+05]
t = 30.313, df = 89327, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 169.6027 193.0513
sample estimates:
mean of x mean of y 
 9770.885  9589.558 

Given that there is a statistically significant difference between the first and second half of the same benchmark, it is not surprising that there is also a statistically significant difference between the two benchmarks.


The graph also hints at another issue. Let's only consider the fast benchmark, and only consider measurements that are at most one standard deviation away from the mean. In other words, the distribution of measurements around the mean:

ggplot(d, aes(index, fast)) +
  geom_point(alpha=0.1, size=0.5) +
  ylim(mean(d$fast) - sd(d$fast), mean(d$fast) + sd(d$fast)) +
  geom_hline(yintercept = mean(d$fast), color = "red")

fast-distribution-around-mean

This does not look like it's close to being a normal distribution; the measurements below the mean are much closer together than those above the mean.

This becomes more apparent when considering how many samples fall within x times the standard deviation. This is what we should be seeing:

library(purrr)
samples <- rnorm(length(d$fast), mean = mean(d$fast), sd = sd(d$fast))
f <- function(x) length(samples[abs(samples - mean(samples)) <= x * sd(samples)]) / length(samples)
map_dbl(c(0.5, 1, 2, 3), f)
0.38253 0.68167 0.95490 0.99747

But this is what we see:

samples <- d$fast
map_dbl(c(0.5, 1, 2, 3), f)
0.85619 0.89095 0.96773 0.97943

But the t test assumes that both sets of samples are from normal distributions, and since that isn't true, the t test is not necessarily meaningful.


My main takeaway from this (at least for the moment) is that timingSafeEqual is not constant-time. It guarantees that the time it takes does not significantly depend on the input data because the underlying algorithm is a constant-time algorithm (and this is what matters to users), but the function itself does not always take the same amount of time, and the deviations from the expected time do not approximate a normal distribution.

This does not explain why the failures are much more frequent on rhel72-ppc64 than on other machines though. We can probably fine-tune this test to reduce the failure probability enough to re-enable the test, but I am not sure we are addressing the real problem. (cc @Trott)

@anonrig
Copy link
Member

anonrig commented Oct 18, 2023

nodejs-github-bot pushed a commit that referenced this issue Oct 20, 2023
PR-URL: #50232
Refs: #38226
Reviewed-By: Ruy Adorno <ruyadorno@google.com>
Reviewed-By: Filip Skokan <panva.ip@gmail.com>
targos pushed a commit that referenced this issue Oct 23, 2023
PR-URL: #50232
Refs: #38226
Reviewed-By: Ruy Adorno <ruyadorno@google.com>
Reviewed-By: Filip Skokan <panva.ip@gmail.com>
alexfernandez pushed a commit to alexfernandez/node that referenced this issue Nov 1, 2023
PR-URL: nodejs#50232
Refs: nodejs#38226
Reviewed-By: Ruy Adorno <ruyadorno@google.com>
Reviewed-By: Filip Skokan <panva.ip@gmail.com>
targos pushed a commit that referenced this issue Nov 11, 2023
PR-URL: #50232
Refs: #38226
Reviewed-By: Ruy Adorno <ruyadorno@google.com>
Reviewed-By: Filip Skokan <panva.ip@gmail.com>
targos pushed a commit that referenced this issue Nov 27, 2023
PR-URL: #50232
Refs: #38226
Reviewed-By: Ruy Adorno <ruyadorno@google.com>
Reviewed-By: Filip Skokan <panva.ip@gmail.com>
@tniessen
Copy link
Member

tniessen commented Apr 4, 2024

Refs: #52341

Not sure if this change might make it better or worse.

sercher added a commit to sercher/graaljs that referenced this issue Apr 25, 2024
PR-URL: nodejs/node#50232
Refs: nodejs/node#38226
Reviewed-By: Ruy Adorno <ruyadorno@google.com>
Reviewed-By: Filip Skokan <panva.ip@gmail.com>
sercher added a commit to sercher/graaljs that referenced this issue Apr 25, 2024
PR-URL: nodejs/node#50232
Refs: nodejs/node#38226
Reviewed-By: Ruy Adorno <ruyadorno@google.com>
Reviewed-By: Filip Skokan <panva.ip@gmail.com>
@mhdawson
Copy link
Member

Seems to be failing a lot on MacOS now. An example:

https://github.com/nodejs/node/actions/runs/8822600805/job/24221172295?pr=52671

=== release test-crypto-timing-safe-equal-benchmarks ===
Path: pummel/test-crypto-timing-safe-equal-benchmarks
Error: --- stderr ---
node:assert:408
    throw err;
    ^

AssertionError [ERR_ASSERTION]: timingSafeEqual should not leak information from its execution time (t=8.54045763079109)
    at Object.<anonymous> (/Users/runner/work/node/node/test/pummel/test-crypto-timing-safe-equal-benchmarks.js:109:1)
    at Module._compile (node:internal/modules/cjs/loader:1476:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1555:10)
    at Module.load (node:internal/modules/cjs/loader:1288:32)
    at Module._load (node:internal/modules/cjs/loader:1104:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:191:14)
    at node:internal/main/run_main_module:30:49 {
  generatedMessage: false,
  code: 'ERR_ASSERTION',
  actual: false,
  expected: true,
  operator: '=='
}

nodejs-github-bot pushed a commit that referenced this issue May 1, 2024
PR-URL: #52751
Refs: #38226
Reviewed-By: Marco Ippolito <marcoippolito54@gmail.com>
Reviewed-By: Filip Skokan <panva.ip@gmail.com>
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: Daniel Lemire <daniel@lemire.me>
Reviewed-By: Richard Lau <rlau@redhat.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
@lpinca
Copy link
Member

lpinca commented May 1, 2024

b876e00 landed, I'm closing this.

@lpinca lpinca closed this as completed May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crypto Issues and PRs related to the crypto subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI.
Projects
None yet
7 participants