Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement array shapes for preg_match() $matches #2589

Draft
wants to merge 35 commits into
base: 1.11.x
Choose a base branch
from

Conversation

staabm
Copy link
Contributor

@staabm staabm commented Aug 26, 2023

I tried solving the shapes with Hoa Regex AST, but I couldn't find the information about capturing groups on the ast nodes.
I don't see a way on how to identify where capturing groups are based on the AST.

therefore I tried the way suggested in phpstan/phpstan#9502 (comment)

@staabm
Copy link
Contributor Author

staabm commented Aug 26, 2023

@mvorisek please provide feedback and if you could come up with a few more test-cases, that would be awesome.

whats the expected result for phpstan/phpstan#9502 ?


public function isFunctionSupported(FunctionReflection $functionReflection, FuncCall $node, TypeSpecifierContext $context): bool
{
return in_array(strtolower($functionReflection->getName()), ['preg_match'], true);
Copy link
Contributor

@mvorisek mvorisek Aug 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preg_match_all should be supported as well

also handle/test all relevant preg_ $flags params like PREG_OFFSET_CAPTURE, PREG_SET_ORDER, ... which have substantial impact on the result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets do preg_match first and see what ondrey is thinking about it

@staabm
Copy link
Contributor Author

staabm commented Aug 27, 2023

memo to me: I need to fix PREG_UNMATCHED_AS_NULL on PHP <= 7.3
https://3v4l.org/3k0mr

@PrinsFrank
Copy link
Contributor

@staabm It would be really great to have this implemented now reportPossiblyNonexistentConstantArrayOffset is available! Can I help here?

@staabm
Copy link
Contributor Author

staabm commented May 14, 2024

I have this PR on my todo list and will rebase it in the near future.

You can help by verifying existing test expectations of this PR or help explore missing tests/cases

@staabm staabm force-pushed the regexshapes branch 2 times, most recently from c08a944 to 0ce2624 Compare May 15, 2024 17:20
@staabm staabm changed the base branch from 1.10.x to 1.11.x May 15, 2024 17:20
staabm and others added 14 commits May 18, 2024 10:52
fix n modifier

support PREG_OFFSET_CAPTURE

cs

separated 8.2 tests

support PREG_OFFSET_CAPTURE|PREG_UNMATCHED_AS_NULL

fix
Co-authored-by: Gregor Harlan <mail@gh01.de>
Co-Authored-By: PrinsFrank <25006490+PrinsFrank@users.noreply.github.com>
Co-Authored-By: PrinsFrank <25006490+PrinsFrank@users.noreply.github.com>
@staabm staabm marked this pull request as ready for review May 18, 2024 09:56
@phpstan-bot
Copy link
Collaborator

This pull request has been marked as ready for review.

@staabm
Copy link
Contributor Author

staabm commented May 18, 2024

In the end I combined both approaches. regex based retrieval of the capturing groups and the AST to identify how many non-optional groups are contained in the regex.

since the type inference now requires a regex which is parsable by Hoa\Regex I commented some type-assertions which currently don't work because the regex parser cannot parse it. I think these can be fixed in the future by applying fixes to the Grammar.pp.

I am neither a regex nor a grammer expert, therefore I left these things for other people more confident in this topic.
overall this means we get already a pretty good type narrowing based on a constant regex.


In a future PR I would improve the types to be even more precise for e.g.
'/Price: (£|€)/i' -> constant string £|€ instead of just string
'/Price: (\d)/i' -> numeric-string instead of string
'/Price: (\s){2,10}/i' -> non-falsey-string instead of string
...

src/Type/Php/RegexShapeMatcher.php Outdated Show resolved Hide resolved

function doNamedSubpattern(string $s): void {
if (preg_match('/\w-(?P<num>\d+)-(\w)/', $s, $matches)) {
// could be assertType('array{0: string, num: string, 1: string, 2: string, 3: string}', $matches);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we can see one example of a regex which is not parsable with the current hoa grammar:

Fatal error: Uncaught Hoa\Compiler\Llk\Parser::parse(): (0) Unexpected token "-" (range) at line 1 and column 4:
/\w-(?P<num>\d+)-(\w)/
   ↑
in /Users/staabm/workspace/phpstan-src/vendor/hoa/compiler/Llk/Parser.php at line 1.
  thrown in /Users/staabm/workspace/phpstan-src/vendor/hoa/compiler/Llk/Parser.php on line 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two issues:

a) (?P<xxx>... is equivalent of supported (?<xxx>... - spec https://www.pcre.org/original/doc/html/pcrepattern.html (search for ˙?p<˙), grammar: https://github.com/hoaproject/Regex/blob/master/Source/Grammar.pp#L73
b) - range - it is reserved character strictly if in [ character group only (otherwise it is regular chanracter and very often used)

Copy link
Contributor Author

@staabm staabm May 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Please provide code suggestions for fixes.

the example work in plain php, so we either need to fix the grammar or leave the typing for these cases behind

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind forking hoaproject/Regex and saving there a php test file I can run with all the issues you found?

Copy link
Contributor Author

@staabm staabm May 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look into this PR here and find all commented assertions. these are the cases which hoa cannot parse.

I have spent the last the 3 days of freetime for your feature request. take your time to do the tests yourself. Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh I see. I got the impression you are working on all these cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look into more cases later. This one was the most important. What about adding proper Hoa Regex unit testing to phpstan-src? We can assert the parsed regexes using https://github.com/mvorisek/Hoa-Regex/blob/0c8a5cbcf696df8f65168704af14ad5ebd48c641/test.php#L13-L25 function.

Copy link
Contributor Author

@staabm staabm May 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, why not. Please send a PR to my branch/fork

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super familiar with phpstan-src codebase - where, what class can I base this unit test on?

Copy link
Contributor Author

@staabm staabm May 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would create a new file in tests/PHPStan/Type. we can later on move it somewhere else if this doesn't work for ondrej.

Comment on lines +7 to +8
-%token anchor \\(bBAZzG)|\^|\$
+%token anchor \\([bBAZzG])|\^|\$
Copy link
Contributor Author

@staabm staabm May 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unreleased hoa/regex upstream fix
hoaproject/Regex@5f670af

Comment on lines 16 to 17
- ( range() | literal() )+
+ ( <class_> | range() | literal() )+ <range>?
Copy link
Contributor Author

@staabm staabm May 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unreleased hoa/regex upstream fix
hoaproject/Regex@ce7fd7b
hoaproject/Regex@e770ada

Copy link
Contributor

@mvorisek mvorisek May 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd patch was reverted, see #2589 (comment)

Comment on lines +25 to +26
-capturing:
+#capturing:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Seldaek
Copy link
Contributor

Seldaek commented May 18, 2024

This looks amazing, thanks @staabm.

Now I'm just wondering how I can get it to recognize Composer\Pcre\Preg::match and co as well to get the same quality of type output there :s

If you can think of some way to make this more reusable by third party libs wrapping preg functions it'd be great.

@staabm
Copy link
Contributor Author

staabm commented May 18, 2024

Now I'm just wondering how I can get it to recognize Composer\Pcre\Preg::match and co as well to get the same quality of type output there :s

If you can think of some way to make this more reusable by third party libs wrapping preg functions it'd be great.

at first I want to see what @ondrejmirtes thinks about all this. he was not yet involved.

After that: I can think of different ways to make this re-usable. for wrapper-libs which are 100% api compatible with the pcre_* native functions we could e.g.

  • rewrite the AST at analysis time
  • create separate extensions which re-use the underlying implementation
  • make it configurable via NEON config
  • introduce a new phpdoc type
  • ...

@Seldaek
Copy link
Contributor

Seldaek commented May 18, 2024

Ok cool thanks. Yes composer/pcre is almost one to one.. Except it throws instead of returning false. For the other api returning object results there it's a bit more complex but anyway I see your point and I'll let you wrap things up here first. Just wanted to let you know in case it can influence design decisions along the way but it's not urgent at all.

Comment on lines +11 to +12
}
assertType('array<string>', $matches);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
assertType('array<string>', $matches);
} else {
assertType('array{}', $matches);
}
assertType('array{string}', $matches);

when pattern is not matched, the $matches is reset to array{} - https://3v4l.org/NE0bm

@ondrejmirtes
Copy link
Member

Some initial thoughts:

  • This should be only for bleeding edge. There is a big potential for bugs so we should limit the impact.
  • There are more preg_ functions that would benefit from the implemented logic, right?
  • I don't like bending TypeSpecifyingExtension for this. This should either be hardcoded in NodeScopeResolver, or there should be some kind of new ParameterOutTypeExtension. Probably the latter, given we want to make this accessible to userland code.
  • There are different flavours of custom PCRE wrappers. Some return matches by-ref as the original functions, some return the matches normally via returned value. The list of our options in Implement array shapes for preg_match() $matches #2589 (comment) is comprehensive, I like "create separate extensions which re-use the underlying implementation" the most. This means that the actual implementation should be in an extension-agnostic service that other extensions can simply inject. This means the service has to be marked with @api.


$remainingNonOptionalGroupCount = $this->countNonOptionalGroups($regex);
if ($remainingNonOptionalGroupCount === null) {
// regex could not be parsed by Hoa/Regex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer emitting a PHPStan error with the parse exception message - the PCRE grammar is actually not so complex, so we can fix the Hoa grammar based on issue reports.

This would effectively also validate constant regexps for syntax errors.

The https://3v4l.org/sOXbn parsing hack can be then removed in favor of Hoa parser.

@ondrejmirtes
Copy link
Member

An idea: send a separate PR with DynamicParameterOutType extensions, and once that is merged, you can base this PR on that.

@staabm staabm marked this pull request as draft May 19, 2024 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants