Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve XRegExp.matchRecursive handling of unbalanced delimiters #96

Closed
farfromrefug opened this issue Aug 26, 2015 · 12 comments
Closed

Comments

@farfromrefug
Copy link

Hi,

First thanks for your great library.
I am using matchRecursive to find JSON messages on a stream.
Every time i receive a message, i push it a string stack.
Then i use matchRecursive to try and find messages. If i find some, i slice those messages from the stack.
Now sometimes some messages get lost and i end up with "unbalanced" stack, like this:
{"toto":"test","bd":{"toto":"test","bd":{"toto":"test","bd":{"toto":"test","bd":1,"asdfg":true}{"toto":"test","bd":{"toto":"test","bd":1,"asdfg":true}{"toto":"test","bd":1,"asdfg":{"toto":2}}

If i use your library like so i cant get the message in this because it's unbalanced.
And this is correct. there is no outputs if you look at this message.

Though there is a simple way to find the messages. Reverse the string and inverse the left and right delimiters.
It works and matchRecursive actually found the message. However it still reports unbalanced and throw at the end.
I easily fixed this locally by only throwing if ouptuts.length===0 like this:

if (output.length === 0) {
                    throw new Error("string contains unbalanced delimiters");
                } else {
                    return output;
                }

That way i can still get the matched strings even if at the end it is unbalanced.
This is very useful.

I was wondering if you would be willing to commit that in the master branch?
I think it would be a great addition (could become an option).

@slevithan
Copy link
Owner

Yes, I'd be totally willing to include an option that allows XRegExp.matchRecursive to skip past unbalanced delimiters. In fact, I'd prefer to make that the default. Perhaps the new option could be called requireBalance, and when set to true it would revert to the current handling where scanning past an unbalanced delimiter throws an error.

Unfortunately, I don't expect to get to this for XRegExp 3.0 final, unless it's provided as a pull request.

The challenge is supporting all of matchRecursive's existing features and providing robust tests proving the change is safe. Right now matchRecursive is under-tested. See https://github.com/slevithan/xregexp/blob/master/tests/spec/s-addons-matchrecursive.js which currently just tests the examples from the readme, rather than testing individual features, combinations, and assumptions.

Features that would need to continue working correctly after such a change is introduced include:

  • Sticky and non-sticky mode, via flag y.
  • Global and non-global mode, via flag g.
  • The escapeChar option.
  • Extended information mode, via the valueNames option. The unbalanced delimiters should appear within the "between" values that are returned. As basic examples, if your delimiters are { and }, and your subject is 'a{b{c}d', I think a{b and d should be "between" values, and c should be the match. Similarly, if your subject is 'a{b}c}d', I think a and c}d should be "between" values, and b should be the match.
  • Maintain behavior (or document reasonable behavior) for when left and right delimiters match some or all of the same strings.
  • Maintain behavior (or document reasonable behavior) for when left and or right delimiters sometimes or always match empty strings.

If this is something you're able to help with, that would be awesome and much appreciated.

@farfromrefug
Copy link
Author

Hi,

I might look at this and create a PR. But all your changes requirements wont work.
Especially handling 'a{b{c}d'. This one wont report results and cant do so because the unbalanced happens before any match is found.

I don't want to make any hard change in your code so my PR will mostly only include the code i posted before.
What i can do is make it an option though. I think i can also look at handling 'a{b}c}d' and c}d beetween value.

@farfromrefug
Copy link
Author

here s the PR #99

@slevithan
Copy link
Owner

I might look at this and create a PR. But all your changes requirements wont work.
Especially handling 'a{b{c}d'. This one wont report results and cant do so because the unbalanced happens before any match is found.

The handling I described is the handling of all regex engines that support recursive matching (Perl, PCRE, and .NET). What happens in the case of something like 'a{b{c}d' is that, when the match attempt starting at the first { reaches the end of the string and discovers that the { is unbalanced, the match attempt fails, and the engine moves on to start a new match attempt at the next character position, and so on. IMO this is also how recursive matching using XRegExp.matchRecursive can and should work, at least by default. Admittedly, this might require significant changes to the function's current implementation.

Aside: It might also make sense to add another option (named something like includeUnterminatedMatches) that would change the handling for unbalanced left delimiters, so that they're not skipped past but instead continue to the end of the subject string. I'm not sure what the use cases for this would be, though. You could get even more flexible with an additional option like unterminatedMatchesEndAtNextLeftDelimiter.

It seems I might have misunderstood your original message, though. Rather than supporting all aspects of recursive matching within unbalanced strings, it seems you're looking to make a narrower change that is fairly specific to your situation (ending a global match early and not throwing on unbalanced delimiters if you've already found at least one balanced match). If that's the case, it might be better to just make the change in your local copy. I'm not sure how common the case you're describing is, and partial handling might cause more confusion and problems than it solves.

More feedback is welcome, and I'll leave this issue open for the general feature of recursive matching within unbalanced strings.

@farfromrefug
Copy link
Author

@slevithan I want to make it work but i am not sure we can when the unbalance starts before any match found.
And in fact the problem is not that no match is found. But instead that openTokens is already >0 when the first match is found.
The problem then becomes the finding of deep delimiters like {{}}
I wrapped my head around it and i don't see a way to discover them.
Having the openTokens reach 0 seems to be the only way to find them and it will never happen in the case of 'a{b{c}d'

But dont worry if we find a way to handle that i will change my PR to support it :D
And yes i kind of stop to there because in a sense it was enough for me and because i did not see any solution. Well for me there was one is handling the reverse string as explained in my first post

Best

@slevithan slevithan changed the title unbalanced: returns output already found Improve XRegExp.matchRecursive handling of unbalanced delimiters May 15, 2016
@olavim
Copy link

olavim commented Apr 2, 2017

+1
I'm trying to find JSON objects from an arbitrary string using matchRecursive. If the string has unmatched curly brackets, I don't get any results since matchRecursive throws and doesn't return anything.

@dvargas92495
Copy link
Contributor

Is there anyone tackling this right now? if not, I'm more than willing to put up a PR for this as I would like to just ignore unbalanced delimeters for a project I'm working on. Am cool with it being default behavior or opt in

@josephfrazier
Copy link
Collaborator

I'm not aware of anyone working on it, go for it! Happy to try to help out along the way

@slevithan
Copy link
Owner

Yeah, that would be great!

Am cool with it being default behavior or opt in

It should start out as opt-in since changing the default behavior would require a major version change.

@slevithan
Copy link
Owner

Thanks to @dvargas92495 for the sweet/clean implementation in #332. It came out even better than expected, and XRegExp.matchRecursive now supports three handling options for unbalanced delimiters via the new unbalanced option: 'error' (default), 'skip', and 'skip-lazy'.

This will go out in the next minor release.

@slevithan
Copy link
Owner

slevithan commented Aug 3, 2021

Aside: There's lots of potential for additional future handling modes for unbalanced delimiters if there are strong use cases. Things like:

  • 'stop-matching-at-first-unbalanced'
  • 'skip-inner-unbalanced' (reverse the 'skip' mode's handling of skipping outer unbalanced delimiters)
  • 'include-trailing-unterminated-match' (treat end-of-string as closing any open delimiters)
  • 'unclosed-left-delimiters-end-at-next-left'

Or whatever else is actually useful for reasonably broad use cases.

@slevithan
Copy link
Owner

@dvargas92495 and others: This is now published on npm as v5.1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants