Skip to content

Commit

Permalink
New: Add no-useless-backreference rule (fixes #12673) (#12690)
Browse files Browse the repository at this point in the history
* New: Add no-useless-backreference rule (fixes #12673)

* Improve opening section in the docs, fix typo in a comment

* Reword the opening paragraph
  • Loading branch information
mdjermanovic authored and kaicataldo committed Jan 9, 2020
1 parent b2c6209 commit a1d999c
Show file tree
Hide file tree
Showing 5 changed files with 838 additions and 0 deletions.
130 changes: 130 additions & 0 deletions docs/rules/no-useless-backreference.md
@@ -0,0 +1,130 @@
# Disallow useless backreferences in regular expressions (no-useless-backreference)

In JavaScript regular expressions, it's syntactically valid to define a backreference to a group that belongs to another alternative part of the pattern, a backreference to a group that appears after the backreference, a backreference to a group that contains that backreference, or a backreference to a group that is inside a negative lookaround. However, by the specification, in any of these cases the backreference always ends up matching only zero-length (the empty string), regardless of the context in which the backreference and the group appear.

Backreferences that always successfully match zero-length and cannot match anything else are useless. They are basically ignored and can be removed without changing the behavior of the regular expression.

```js
var regex = /^(?:(a)|\1b)$/;

regex.test("a"); // true
regex.test("b"); // true!
regex.test("ab"); // false

var equivalentRegex = /^(?:(a)|b)$/;

equivalentRegex.test("a"); // true
equivalentRegex.test("b"); // true
equivalentRegex.test("ab"); // false
```

Useless backreference is a possible error in the code. It usually indicates that the regular expression does not work as intended.

## Rule Details

This rule aims to detect and disallow the following backreferences in regular expression:

* Backreference to a group that is in another alternative, e.g., `/(a)|\1b/`. In such constructed regular expression, the backreference is expected to match what's been captured in, at that point, a non-participating group.
* Backreference to a group that appears later in the pattern, e.g., `/\1(a)/`. The group hasn't captured anything yet, and ECMAScript doesn't support forward references. Inside lookbehinds, which match backward, the opposite applies and this rule disallows backreference to a group that appears before in the same lookbehind, e.g., `/(?<=(a)\1)b/`.
* Backreference to a group from within the same group, e.g., `/(\1)/`. Similar to the previous, the group hasn't captured anything yet, and ECMAScript doesn't support nested references.
* Backreference to a group that is in a negative lookaround, if the backreference isn't in the same negative lookaround, e.g., `/a(?!(b)).\1/`. A negative lookaround (lookahead or lookbehind) succeeds only if its pattern cannot match, meaning that the group has failed.

By the ECMAScript specification, all backreferences listed above are valid, always succeed to match zero-length, and cannot match anything else. Consequently, they don't produce parsing or runtime errors, but also don't affect the behavior of their regular expressions. They are syntactically valid but useless.

This might be surprising to developers coming from other languages where some of these backreferences can be used in a meaningful way.

```js
// in some other languages, this pattern would successfully match "aab"

/^(?:(a)(?=a)|\1b)+$/.test("aab"); // false
```

Examples of **incorrect** code for this rule:

```js
/*eslint no-useless-backreference: "error"*/

/^(?:(a)|\1b)$/; // reference to (a) into another alternative

/^(?:(a)|b(?:c|\1))$/; // reference to (a) into another alternative

/^(?:a|b(?:(c)|\1))$/; // reference to (c) into another alternative

/\1(a)/; // forward reference to (a)

RegExp('(a)\\2(b)'); // forward reference to (b)

/(?:a)(b)\2(c)/; // forward reference to (c)

/\k<foo>(?<foo>a)/; // forward reference to (?<foo>a)

/(?<=(a)\1)b/; // backward reference to (a) from within the same lookbehind

/(?<!(a)\1)b/; // backward reference to (a) from within the same lookbehind

new RegExp('(\\1)'); // nested reference to (\1)

/^((a)\1)$/; // nested reference to ((a)\1)

/a(?<foo>(.)b\1)/; // nested reference to (?<foo>(.)b\1)

/a(?!(b)).\1/; // reference to (b) into a negative lookahead

/(?<!(a))b\1/; // reference to (a) into a negative lookbehind
```

Examples of **correct** code for this rule:

```js
/*eslint no-useless-backreference: "error"*/

/^(?:(a)|(b)\2)$/; // reference to (b)

/(a)\1/; // reference to (a)

RegExp('(a)\\1(b)'); // reference to (a)

/(a)(b)\2(c)/; // reference to (b)

/(?<foo>a)\k<foo>/; // reference to (?<foo>a)

/(?<=\1(a))b/; // reference to (a), correctly before the group as they're in the same lookbehind

/(?<=(a))b\1/; // reference to (a), correctly after the group as the backreference isn't in the lookbehind

new RegExp('(.)\\1'); // reference to (.)

/^(?:(a)\1)$/; // reference to (a)

/^((a)\2)$/; // reference to (a)

/a(?<foo>(.)b\2)/; // reference to (.)

/a(?!(b|c)\1)./; // reference to (b|c), correct as it's from within the same negative lookahead

/(?<!\1(a))b/; // reference to (a), correct as it's from within the same negative lookbehind
```

Please note that this rule does not aim to detect and disallow a potentially erroneous use of backreference syntax in regular expressions, like the use in character classes or an attempt to reference a group that doesn't exist. Depending on the context, a `\1`...`\9` sequence that is not a syntactically valid backreference may produce syntax error, or be parsed as something else (e.g., as a legacy octal escape sequence).

Examples of additional **correct** code for this rule:

```js
/*eslint no-useless-backreference: "error"*/

// comments describe behavior in a browser

/^[\1](a)$/.test("\x01a"); // true. In a character class, \1 is treated as an octal escape sequence.
/^\1$/.test("\x01"); // true. Since the group 1 doesn't exist, \1 is treated as an octal escape sequence.
/^(a)\1\2$/.test("aa\x02"); // true. In this case, \1 is a backreference, \2 is an octal escape sequence.
```

## Further Reading

* [MDN: Regular Expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions)

## Related Rules

* [no-control-regex](no-control-regex.md)
* [no-empty-character-class](no-empty-character-class.md)
* [no-invalid-regexp](no-invalid-regexp.md)
1 change: 1 addition & 0 deletions lib/rules/index.js
Expand Up @@ -216,6 +216,7 @@ module.exports = new LazyLoadingRuleMap(Object.entries({
"no-unused-labels": () => require("./no-unused-labels"),
"no-unused-vars": () => require("./no-unused-vars"),
"no-use-before-define": () => require("./no-use-before-define"),
"no-useless-backreference": () => require("./no-useless-backreference"),
"no-useless-call": () => require("./no-useless-call"),
"no-useless-catch": () => require("./no-useless-catch"),
"no-useless-computed-key": () => require("./no-useless-computed-key"),
Expand Down
193 changes: 193 additions & 0 deletions lib/rules/no-useless-backreference.js
@@ -0,0 +1,193 @@
/**
* @fileoverview Rule to disallow useless backreferences in regular expressions
* @author Milos Djermanovic
*/

"use strict";

//------------------------------------------------------------------------------
// Requirements
//------------------------------------------------------------------------------

const { CALL, CONSTRUCT, ReferenceTracker, getStringIfConstant } = require("eslint-utils");
const { RegExpParser, visitRegExpAST } = require("regexpp");
const lodash = require("lodash");

//------------------------------------------------------------------------------
// Helpers
//------------------------------------------------------------------------------

const parser = new RegExpParser();

/**
* Finds the path from the given `regexpp` AST node to the root node.
* @param {regexpp.Node} node Node.
* @returns {regexpp.Node[]} Array that starts with the given node and ends with the root node.
*/
function getPathToRoot(node) {
const path = [];
let current = node;

do {
path.push(current);
current = current.parent;
} while (current);

return path;
}

/**
* Determines whether the given `regexpp` AST node is a lookaround node.
* @param {regexpp.Node} node Node.
* @returns {boolean} `true` if it is a lookaround node.
*/
function isLookaround(node) {
return node.type === "Assertion" &&
(node.kind === "lookahead" || node.kind === "lookbehind");
}

/**
* Determines whether the given `regexpp` AST node is a negative lookaround node.
* @param {regexpp.Node} node Node.
* @returns {boolean} `true` if it is a negative lookaround node.
*/
function isNegativeLookaround(node) {
return isLookaround(node) && node.negate;
}

//------------------------------------------------------------------------------
// Rule Definition
//------------------------------------------------------------------------------

module.exports = {
meta: {
type: "problem",

docs: {
description: "disallow useless backreferences in regular expressions",
category: "Possible Errors",
recommended: false,
url: "https://eslint.org/docs/rules/no-useless-backreference"
},

schema: [],

messages: {
nested: "Backreference '{{ bref }}' will be ignored. It references group '{{ group }}' from within that group.",
forward: "Backreference '{{ bref }}' will be ignored. It references group '{{ group }}' which appears later in the pattern.",
backward: "Backreference '{{ bref }}' will be ignored. It references group '{{ group }}' which appears before in the same lookbehind.",
disjunctive: "Backreference '{{ bref }}' will be ignored. It references group '{{ group }}' which is in another alternative.",
intoNegativeLookaround: "Backreference '{{ bref }}' will be ignored. It references group '{{ group }}' which is in a negative lookaround."
}
},

create(context) {

/**
* Checks and reports useless backreferences in the given regular expression.
* @param {ASTNode} node Node that represents regular expression. A regex literal or RegExp constructor call.
* @param {string} pattern Regular expression pattern.
* @param {string} flags Regular expression flags.
* @returns {void}
*/
function checkRegex(node, pattern, flags) {
let regExpAST;

try {
regExpAST = parser.parsePattern(pattern, 0, pattern.length, flags.includes("u"));
} catch (e) {

// Ignore regular expressions with syntax errors
return;
}

visitRegExpAST(regExpAST, {
onBackreferenceEnter(bref) {
const group = bref.resolved,
brefPath = getPathToRoot(bref),
groupPath = getPathToRoot(group);
let messageId = null;

if (brefPath.includes(group)) {

// group is bref's ancestor => bref is nested ('nested reference') => group hasn't matched yet when bref starts to match.
messageId = "nested";
} else {

// Start from the root to find the lowest common ancestor.
let i = brefPath.length - 1,
j = groupPath.length - 1;

do {
i--;
j--;
} while (brefPath[i] === groupPath[j]);

const indexOfLowestCommonAncestor = j + 1,
groupCut = groupPath.slice(0, indexOfLowestCommonAncestor),
commonPath = groupPath.slice(indexOfLowestCommonAncestor),
lowestCommonLookaround = commonPath.find(isLookaround),
isMatchingBackward = lowestCommonLookaround && lowestCommonLookaround.kind === "lookbehind";

if (!isMatchingBackward && bref.end <= group.start) {

// bref is left, group is right ('forward reference') => group hasn't matched yet when bref starts to match.
messageId = "forward";
} else if (isMatchingBackward && group.end <= bref.start) {

// the opposite of the previous when the regex is matching backward in a lookbehind context.
messageId = "backward";
} else if (lodash.last(groupCut).type === "Alternative") {

// group's and bref's ancestor nodes below the lowest common ancestor are sibling alternatives => they're disjunctive.
messageId = "disjunctive";
} else if (groupCut.some(isNegativeLookaround)) {

// group is in a negative lookaround which isn't bref's ancestor => group has already failed when bref starts to match.
messageId = "intoNegativeLookaround";
}
}

if (messageId) {
context.report({
node,
messageId,
data: {
bref: bref.raw,
group: group.raw
}
});
}
}
});
}

return {
"Literal[regex]"(node) {
const { pattern, flags } = node.regex;

checkRegex(node, pattern, flags);
},
Program() {
const scope = context.getScope(),
tracker = new ReferenceTracker(scope),
traceMap = {
RegExp: {
[CALL]: true,
[CONSTRUCT]: true
}
};

for (const { node } of tracker.iterateGlobalReferences(traceMap)) {
const [patternNode, flagsNode] = node.arguments,
pattern = getStringIfConstant(patternNode, scope),
flags = getStringIfConstant(flagsNode, scope);

if (typeof pattern === "string") {
checkRegex(node, pattern, flags || "");
}
}
}
};
}
};

0 comments on commit a1d999c

Please sign in to comment.