Better unsafe regex detector #28

davisjam · 2018-03-29T18:27:55Z

Hi all,

I'm a systems/security researcher at Virginia Tech and have been studying the incidence of vulnerable regexes in the wild.

This plugin's unsafe regex detector relies on safe-regex, which uses star height (nested quantifiers) to identify unsafe regexes.

Pros:

safe-regex is fast.
safe-regex is an npm module which makes it easy to work with.
safe-regex has no non-JS dependencies.

As a result, safe-regex is great for CI use cases.

Cons:

safe-regex is incorrectly implemented and substack is not maintaining it.
safe-regex has lots of false positives (e.g. (ab+)+).
safe-regex will only identify one type of exponential-time vulnerability, and ignores all polynomial-time vulnerabilities. In my research I found that, in the wild, polynomial-time vulnerabilities are far more common than exp-time vulnerabilities.

There are some alternatives to safe-regex that report exploit strings so you can tell if they're correct or not.

Rathnayake's rxxr2. Like safe-regex, this only checks for star height-style vulnerabilities. But it doesn't have false positives as far as I can tell.
Wustholz's REXPLOITER. This tests star height and other exp-time vulnerabilities, plus poly-time vulnerabilities.
Weideman's RegexStaticAnalysis. Like Wustholz's REXPLOITER, but open-source and it works better.

Unfortunately:

These alternatives all have non-JS dependencies (e.g. OCaml or Java) and have inconsistent interfaces.
Some (especially Weideman) can take minutes to test a single regex.

My project vuln-regex-detector provides a convenient wrapper for these alternatives, and enforces time and memory limits to get results or fail relatively quickly.

However, I'd be surprised if developers were willing to wait even 30 seconds for linting. To address that, I'm nearly done implementing a server side so queries can be answered by hitting the server for a pre-computed answer instead of doing the expensive computation locally. The server processes not-seen-before queries in the background so subsequent queries will get a real answer.

Once that's done, would you folks be interested in hitting my server first and falling back to safe-regex if my server hasn't seen the query before? I've got a sample client that can be used with a one-line tweak for this use case.

The text was updated successfully, but these errors were encountered:

davisjam · 2018-03-31T13:02:59Z

@evilpacket I'm happy to submit a PR with a demonstration if desired.

stephanschubert · 2018-10-13T19:23:58Z

@evilpacket Sounds good?

evilpacket · 2019-04-02T19:54:55Z

@davisjam You've updated safe-regex since submitting this and I plan to bring in that change.

For something that hits an external source I'd like it to be configurable or a different rule and clear as to what is sent to the remote server etc. A lot of people get nervous when code they are scanning starts to get sent to a remote destination, but having this capability would be nice. Would the server side be something that others could also run themselves?

lirantal · 2019-04-02T20:55:46Z

@evilpacket the gist of it is probably all at https://github.com/davisjam/vuln-regex-detector/

devinrhode2 · 2019-09-11T17:54:55Z

Yeah I think best bet here is to not be sending anything to a remote server, and instead implement the caching stuff locally....
At least ask the person running this tool for permission when a new regex is being sent out

davisjam · 2019-09-11T18:01:29Z

The server-side code is available, but would have to be run on the user's side.

Unfortunately the existing advanced analyses are written in Java and OCaml, not JS.

I have half a million labeled regexes, so I suppose another option is to ship this database in safe-regex in compressed form as a "cache" of sorts.

MatthewHerbst · 2019-09-12T01:51:33Z

@davisjam is there a technical reason preventing the tools from being written in JS, or has that work just not been done by anyone?

davisjam · 2019-09-12T12:16:16Z

The analysis could be written in JavaScript, and I would happily incorporate it into safe-regex. I can point anyone interested in this in the right direction.

…

On Wed, Sep 11, 2019, 9:51 PM Matthew Herbst ***@***.***> wrote: @davisjam <https://github.com/davisjam> is there a technical reason preventing the tools from being written in JS, or has that work just not been done by anyone? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#28>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFOD3L2EY7V6EJKQCHBZAO3QJGOCNANCNFSM4EYDDY7A> .

makenowjust · 2020-11-24T10:27:04Z

@davisjam I will introduce my research MakeNowJust-Labo/redos.

This is a hybrid ReDoS detection library. "Hybrid" means, it combines automaton based detector and fuzzing based detector. When the RegExp pattern is hard to be checked by the automaton based, it switches to use the fuzzing based. This allows checking many patterns in the real world.

This library is written in Scala, but it is compiled to JavaScript by Scala.js. So, you can use this from JavaScript easily. In addition, I published eslint-plugin-redos for easy use. Feel free to try it!

davisjam · 2020-11-24T15:49:35Z

@makenowjust Thanks for letting me know about it! Looks cool.

I looked over the documentation (not the implementation) but I'm not sure exactly what you've made. Are you building on the academic research on the topic (e.g. Shen for fuzzing, Weideman/Wustholz/Rathnayake for NFA analysis)? What guarantees does the library offer?

makenowjust · 2020-11-24T17:02:57Z

@davisjam I'm not sure the below may be an answer to you. Sorry that my English is poor.

Automaton based detector: It checks whether the pattern potentially causes catastrophic backtracking (exponential, and 2nd or higher degree polynomial matching time) or not. This detector is built on automata theory, like Weideman's RegexStaticAnalysis.
Fuzzing based detector: It finds a string in a limited size whose VM execution steps exceeds the specified limit by using the fuzzing technique. The VM is based on the QuickJS implementation, which is the simple and naive implementation of JS standard (ECMA-262).

More discussions are needed to conclude "what they detect are same.", but I believe both of them can find a vulnerability. And, each of them has pros and cons. For example, automaton based detector can find vulnerability precisely, however it is very slow on several patterns, and hard to support some syntax. On the other hand, fuzzing based detector can support all syntax, and we can control analyzing time easily, however it is hard to avoid false-negative. So, I think it is better to combine both.

Fuzzing based detector refers Shen's ReScue, but my implementation is very extended. In fact, ReScue cannot find a polynomial order case, but my implementation can in many cases.

I'm preparing a paper about this now. And, I'm going to talk about this at the 62nd programming symposium in Japan (I'll talk in Japanese, sorry). I'm planning to explain detailed ideas (detector selection algorithm, and fuzzing improvements) here.

davisjam · 2020-12-02T01:26:35Z

@makenowjust Happy to discuss more over email -- I sent you a note a few days ago.

thernstig · 2023-01-30T09:11:35Z

@nzakas would it make sense to replace https://www.npmjs.com/package/safe-regex with https://github.com/makenowjust-labs/recheck? ReCheck seems better maintained, but I am in no position to gauge if it is better or not nor if it is a smart choice to switch.

davisjam · 2023-01-30T13:04:30Z

(I own safe-regex. It's very simple, not much to maintain). ReCheck is employing a more sophisticated algorithm that, based on the docs, should have more true positives and fewer false positives compared to safe-regex. ReCheck will be more time consuming to run. One potential check is how ReCheck would perform on a regex heavy codebase (say, the marked project for markdown). Based on my experience with similar tools, a local cache might be helpful for on-workstation use. I'm not sure if ReCheck has one. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Tobias Hernstig ***@***.***> Sent: Monday, January 30, 2023 4:11:47 AM To: eslint-community/eslint-plugin-security ***@***.***> Cc: Davis, James C ***@***.***>; Mention ***@***.***> Subject: Re: [eslint-community/eslint-plugin-security] Better unsafe regex detector (#28)

---- External Email: Use caution with attachments, links, or sharing data ---- @nzakas<https://github.com/nzakas> would it make sense to replace https://www.npmjs.com/package/safe-regex with https://github.com/makenowjust-labs/recheck? ReCheck seems better maintained, but I am in no position to gauge if it is better or not nor if it is a smart choice to switch. — Reply to this email directly, view it on GitHub<#28 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFOD3L5BIMOYH2KTYM7KN7TWU6AVFANCNFSM4EYDDY7A>. You are receiving this because you were mentioned.Message ID: ***@***.***>

thernstig · 2023-01-30T14:19:32Z

@davisjam your original post indicated that https://github.com/davisjam/vuln-regex-detector would be a better choice than https://github.com/davisjam/safe-regex, that you are both maintainer of.

Your last post in this thread indicated you reached out to @makenowjust but that seem to never have concluded something in this thread at least.

The question is then, are both projects worthy replacements and the option should be on the user, or what would be the best fit for eslint-plugin-security?

I think most code bases uses very few regexes that the extra time needed should be low enough so that the one with the most true positives and fewest false positives would be the best one to use, but that is my subjective opinion. We benchmarked safe-regex vs recheck and we are only talking milliseconds for some of the more complex regexes in our tests, but they were not exhaustive.

nzakas · 2023-01-31T18:43:08Z

The goal of the plugin is to try to catch as many problems as possible with as few false positives and false negatives as possible. Anything that moves us further in that direction is worthwhile to investigate.

That said, I'm not a security researcher, so I'd lean on the recommendations of folks like @davisjam to decide how to move forward here.

thernstig · 2023-02-01T10:11:12Z

@nzakas neither are we. We tested all of them and found the one by @makenowjust to have the least false positives and most true positives, but granted we just tested a few regexes. So that is very anecdotal and should not hold any ground for choosing one.

nzakas · 2023-02-02T18:37:33Z

If you could share your findings in this thread, maybe we can get some folks to review them and verify.

thernstig · 2023-02-07T14:38:38Z

@nzakas here are our simple results:

regex	Expected result	recheck async	recheck sync	safe-regex
`^test`	safe	safe (0.575 ms)	safe (30.778 ms)	safe (0.722 ms)
`^((a)+)+z$`	vulnerable	vulnerable (6.740 ms)	vulnerable (95.845 ms)	vulnerable (0.571 ms)
`^[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}$`	safe	safe (0.986 ms)	safe (4.159 ms)	vulnerable (1.422 ms)
`(\w+)\_(\w+)$`	vulnerable	vulnerable (2.950 ms)	vulnerable (43.759 ms)	safe (0.692 ms)
`^[0-9a-f]{4}$`	safe	safe (0.418 ms)	safe (1.655 ms)	safe (0.720 ms)
`^[0-9a-f]{4,30}$`	safe	safe (9.319 ms)	safe (228.656 ms)	safe (1.250 ms)
`^(?<foo>(?:Foo\|Bar))$`	safe	safe (25.991 ms)	safe (51.447 ms)	safe (0.758 ms)
`^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})$`	safe	safe (2.964 ms)	safe (3.086 ms)	safe (0.862 ms)

Using this script (not the prettiest):

import { performance } from 'node:perf_hooks';
import { check, checkSync } from 'recheck';
import safeRegex from 'safe-regex';

// See https://github.com/davisjam/vuln-regex-detector/issues/81
// 'vuln-regex-detector' is not tested as it does not work properly

const regexes = [
  { regex: /^test/, vulnerable: false },
  { regex: /^((a)+)+z$/, vulnerable: true },
  { regex: /^[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}$/, vulnerable: false },
  { regex: /(\w+)\_(\w+)$/, vulnerable: true },
  { regex: /^[0-9a-f]{4}$/, vulnerable: false },
  { regex: /^[0-9a-f]{4,30}$/, vulnerable: false },
  { regex: /^(?<foo>(?:Foo|Bar))$/, vulnerable: false },
  {
    regex: /^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})$/,
    vulnerable: false,
  },
];

const resultPrinter = (result) => (result ? 'safe' : 'vulnerable');

// Call each function once in case they have code just running once on first call
safeRegex('foo');
await check('foo', '');

for (const regex of regexes) {
  const pattern = regex.regex.source; // String (RegExp.prototype.source)
  let name; // npm package name
  let result;

  // recheck async
  name = 'recheck-async';
  performance.mark(`start-${name}`);
  result = await check(pattern, '');
  performance.mark(`end-${name}`);
  performance.measure(`${name} ${pattern}`, {
    detail: { name, pattern, result: result.status },
    start: `start-${name}`,
    end: `end-${name}`,
  });

  // recheck sync
  name = 'recheck-sync';
  performance.mark(`start-${name}`);
  result = checkSync(pattern, '');
  performance.mark(`end-${name}`);
  performance.measure(`${name} ${pattern}`, {
    detail: { name, pattern, result: result.status },
    start: `start-${name}`,
    end: `end-${name}`,
  });

  // safe-regex
  name = 'safe-regex';
  performance.mark(`start-${name}`);
  result = safeRegex(pattern);
  performance.mark(`end-${name}`);
  performance.measure(`${name} ${pattern}`, {
    detail: { name, pattern, result: resultPrinter(result) },
    start: `start-${name}`,
    end: `end-${name}`,
  });
}

const map = new Map();

performance.getEntriesByType('measure').forEach((measure) => {
  if (map.has(measure.detail.pattern)) {
    map.get(measure.detail.pattern).push(measure);
  } else {
    map.set(measure.detail.pattern, [measure]);
  }
});

map.forEach((measures, pattern) => {
  console.log(`regex: /${pattern}/`);
  console.log(
    `  expected result: ${
      regexes.filter((regex) => regex.regex.source === pattern)[0].vulnerable
        ? 'vulnerable'
        : 'safe'
    }`
  );
  measures.forEach((measure) => {
    console.log(
      `  ${measure.detail.name}: ${
        measure.detail.result
      } (${measure.duration.toFixed(3)} ms)`
    );
  });
  console.log();
});

nzakas · 2023-02-07T18:11:26Z

Thanks. For clarity, can you also add a column for expected result? (Useful for places where different checkers return different results.)

thernstig · 2023-02-08T14:11:29Z

Updated.

@davisjam and @makenowjust can you double-confirm the above results?

jgeurts · 2023-02-08T18:23:34Z

We've seen false positives with the unsafe regex rule and named capture groups. Would it be possible to add a row to that table for regular expressions with named capture group(s)? E.g. ^(?<foo>(?:Foo|Bar))$ or maybe ^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})$

thernstig · 2023-02-13T13:41:50Z

@jgeurts I tried to update the table a few days ago, but ran into bugs with both of the packages when trying again:

See:
makenowjust-labs/recheck#758
davisjam/vuln-regex-detector#81

I expect @makenowjust to fix the recheck problem, but I am not sure about vuln-regex-detector as the repository has very low activity.

Once I get some fixes, I plan to add a script in this issue that can be used to compare them. I will then add your regexes @jgeurts.

nzakas · 2023-02-13T20:45:35Z

Thanks @thernstig! We will wait to move forward for that update.

davisjam · 2023-02-14T14:55:00Z

Looks like @makenowjust has been able to resolve the issue in recheck 🥳 . I do not think I have time to resolve the issue in vuln-regex-detector.

FWIW The table of results ("expected result") in #28 (comment) looks correct to me.

thernstig · 2023-02-24T14:30:13Z

@nzakas @makenowjust @jgeurts I have updated the table. Scrutinize the code used to test if you have time.

nzakas · 2023-02-24T20:00:20Z

Thanks so much! It does look like recheck is more accurate, but to get a true apples-to-apples comparison, I think we need to run checkSync instead of check for recheck. The check method is running asynchronously so the timing information isn't accurate.

makenowjust · 2023-02-25T02:40:01Z

check and checkSync are quite different: check forks the native recheck binary and uses it, but checkSync uses the pure JS implementation compiled by Scala.js. Thus, checkSync is often slower than check in the long term. However, since ESLint limitation, we cannot use check in ESLint plugin.

thernstig · 2023-03-03T08:43:16Z

@nzakas @makenowjust updated the table with checkSync. Definitely the results became worse. Great if ESLint in the future can allow for async calls.

So changing will suffer quite a bit of performance, but be more correct.

@nzakas I assume it is up to you to decide now what to do with this information. Let me know if there are other packages and/or regexes you want added.

ota-meshi · 2023-03-03T10:49:20Z

If asynchronous processing is a bottleneck for us, I think we might be able to use synckit.
https://github.com/un-ts/synckit

thernstig · 2023-03-03T12:31:52Z

I assume that ESLint intends to allow for asynchronous processing sooner or later? Maybe better focus on fixing that?

eslint-community/eslint-plugin-security#28 (comment)

nzakas · 2023-03-06T21:48:34Z

Async rules in ESLint are still in the future, so we can't wait for that.

Unfortunately, I don't think we can use recheck right now considering how much slower it is. If it were just a little slower, that would be one thing, but in some cases it's an order of magnitude slower, and I don't think the tradeoff is worth it.

thernstig · 2023-03-07T10:43:08Z

Let's see if @makenowjust can solve this as he is implementing synckit. If that improves the numbers using synckit is that acceptable @nzakas ?

nzakas · 2023-03-07T19:06:08Z

If that can bring the numbers closer to safe-regex, then definitely. Although it might be worth the effort to see if we can improve safe-regex to be more accurate.

eslint-community/eslint-plugin-security#28 (comment)

kurtextrem · 2023-10-07T18:11:06Z

Looks like there is now also https://github.com/tjenkinson/eslint-plugin-redos-detector, using https://github.com/tjenkinson/redos-detector in case anyone is curious (or looking for alternatives). I haven't measured how redos-detector compares to recheck though.

devinrhode2 mentioned this issue Sep 11, 2019

The no-unsafe-regex rule is annoying sindresorhus/eslint-plugin-unicorn#153

Closed

RunDevelopment mentioned this issue Apr 19, 2021

Detect super-linear worst-case runtime ota-meshi/eslint-plugin-regexp#159

Open

thernstig mentioned this issue Mar 3, 2023

Docs: Mention that the eslint plugin uses checkSync due to an ESLint limitation makenowjust-labs/recheck#790

Open

makenowjust added a commit to makenowjust-labs/recheck that referenced this issue Mar 6, 2023

Introduce synckit for speed-up checkSync

f3a8bcb

eslint-community/eslint-plugin-security#28 (comment)

makenowjust mentioned this issue Mar 6, 2023

Introduce synckit for speed-up checkSync makenowjust-labs/recheck#799

Merged

makenowjust added a commit to makenowjust-labs/recheck that referenced this issue Mar 8, 2023

Introduce synckit for speed-up checkSync

051851e

eslint-community/eslint-plugin-security#28 (comment)

kurtextrem mentioned this issue Oct 7, 2023

chore: add regexp related eslint plugins fabian-hiller/valibot#202

Merged

Better unsafe regex detector #28

Better unsafe regex detector #28

Comments

davisjam commented Mar 29, 2018 • edited

davisjam commented Mar 31, 2018 • edited

stephanschubert commented Oct 13, 2018

evilpacket commented Apr 2, 2019

lirantal commented Apr 2, 2019

devinrhode2 commented Sep 11, 2019

davisjam commented Sep 11, 2019

MatthewHerbst commented Sep 12, 2019

davisjam commented Sep 12, 2019 via email

makenowjust commented Nov 24, 2020 • edited

davisjam commented Nov 24, 2020

makenowjust commented Nov 24, 2020

davisjam commented Dec 2, 2020

thernstig commented Jan 30, 2023

davisjam commented Jan 30, 2023 via email

thernstig commented Jan 30, 2023

nzakas commented Jan 31, 2023

thernstig commented Feb 1, 2023 • edited

nzakas commented Feb 2, 2023

thernstig commented Feb 7, 2023 • edited

nzakas commented Feb 7, 2023

thernstig commented Feb 8, 2023 • edited

jgeurts commented Feb 8, 2023

thernstig commented Feb 13, 2023

nzakas commented Feb 13, 2023

davisjam commented Feb 14, 2023

thernstig commented Feb 24, 2023

nzakas commented Feb 24, 2023

makenowjust commented Feb 25, 2023

thernstig commented Mar 3, 2023

ota-meshi commented Mar 3, 2023

thernstig commented Mar 3, 2023

nzakas commented Mar 6, 2023

thernstig commented Mar 7, 2023

nzakas commented Mar 7, 2023

kurtextrem commented Oct 7, 2023

davisjam commented Mar 29, 2018 •

edited

davisjam commented Mar 31, 2018 •

edited

makenowjust commented Nov 24, 2020 •

edited

thernstig commented Feb 1, 2023 •

edited

thernstig commented Feb 7, 2023 •

edited

thernstig commented Feb 8, 2023 •

edited