Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Regex<AnyRegexOutput> to be used in the DSL. #504

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rxwei
Copy link
Member

@rxwei rxwei commented Jun 20, 2022

Apply a workaround to allow Regex<AnyRegexOutput> to be used in the DSL. This workaround emits each nested Regex<AnyRegexOutput> as a custom matcher so that it's essentially treated as a separate compilation unit.

A proper fix for this is to introduce scoped type erasure in the matching engine so that all type erasure (including the top-level one) goes through this model. I left some stubs in (beginTypeErase, endTypeErase) which I'll implement in a follow-up.

Since this implementation is using Executor.match(...) -> Regex.Match in the regex compiler, we need to add availability annotations to the Executor and Compiler types.

Resolves rdar://94320030.

@milseman
Copy link
Collaborator

Apply a workaround to allow Regex to be used in the DSL. This workaround emits each nested Regex as a custom matcher so that it's essentially treated as a separate compilation unit.

What does this do to anchors? Do we need to fix up the search bounds?

Copy link
Collaborator

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the intended long-term design? I.e. what are the erasure instructions doing?

@@ -278,6 +278,10 @@ extension Instruction {
///
case backreference

case beginTypeErase

case endTypeErase
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do these instructions do? What's their semantics?

Copy link
Member Author

@rxwei rxwei Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments. The processor will maintain a stack of capture lists. beginTypeErase will push a new list onto the capture stack, so that all captures get added to that list. endTypeErase(_: ValReg) will convert all elements of the current capture list to an AnyRegexOutput, and moves it to the given value register, and pop the stack. It's similar to how matcher works.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fully parsing this. Could you write some XFAIL tests that illustrate the work that remains to be done after this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no XFAIL-able tests. This PR fixes the bug, but it's not as efficient as handling this in the bytecode directly. I'd be happy to chat in person also.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would begin/endTypeErase work in the context of backtracking. For that matter, how does this PR support any backtracking into ARO at all, i.e. isn't this forcing it to be atomic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think backtracking into ARO will be supported, since savePoints currently stores the capture list. It can be extended to store the capture stack.

func testTypeErasedRegexInDSL() throws {
do {
let input = "johnappleseed: 12."
let numberRegex = try! Regex(#"(\d+)\.?"#)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test case where the dynamic regex begins with a ^? I think that would help clarify the behavior you're proposing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one. The idea is to fix up the search bounds so that ^ refers to the start of the input.

It happens to work today because anchors currently use the base string's bounds, not the input slice's bounds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, there is a bug here that another bug is masking, but when that other bug is fixed, how do we fix this bug?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can illustrate the difference with a substring input, where the subject bounds are the substring's bounds and the search bounds are contained within.

@rxwei rxwei force-pushed the rdar94320030 branch 2 times, most recently from aa0ca5b to 63124f5 Compare June 21, 2022 22:45
Apply a workaround to allow `Regex<AnyRegexOutput>` to be used in the DSL. This workaround emits each nested `Regex<AnyRegexOutput>` as a custom matcher so that it's essentially treated as a separate compilation unit.

A proper fix for this is to introduce scoped type erasure in the matching engine so that all type erasure (including the top-level one) goes through this model. I left some stubs in (`beginTypeErase`, `endTypeErase`) which I'll implement in a follow-up.

Since this implementation is using `Executor.match(...) -> Regex.Match` in the regex compiler, we need to add availability annotations to the `Executor` and `Compiler` types.

Resolves rdar://94320030.
@rxwei rxwei marked this pull request as ready for review June 21, 2022 22:56
@rxwei
Copy link
Member Author

rxwei commented Jun 21, 2022

What's the intended long-term design? I.e. what are the erasure instructions doing?

#504 (comment)

Hope this clarifies the intended design. Let me know if you have any feedback.

@rxwei
Copy link
Member Author

rxwei commented Jun 21, 2022

@swift-ci please test

1 similar comment
@milseman
Copy link
Collaborator

@swift-ci please test

@rxwei
Copy link
Member Author

rxwei commented Jun 23, 2022

There's a small bug in there and the fix should be relatively simple. I'll take a look tomorrow.

// endTypeErase
let program = try Compiler(tree: DSLTree(child)).emit()
let executor = Executor(program: program)
return emitMatcher { input, startIndex, range in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matchers don't get the subject bounds, so I don't see how this would support anchors that refer to the subject bounds. We could have a different matcher interface or else add subject bounds as an additional parameter for the internal code (probably not API though).

if case .typeErase = root {
return self
}
return .init(node: .typeErase(root))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this on creation instead, or somewhere else?

var root = root
while case let .typeErase(child) = root {
root = child
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this scan in here? It seems like you'd want to capture an ARO, is that not possible?

ZeroOrMore(.whitespace)
Capture { numberRegex }
}
XCTAssertNil(input.wholeMatch(of: regex))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Positive match tests with anchors?

func testTypeErasedRegexInDSL() throws {
do {
let input = "johnappleseed: 12."
let numberRegex = try! Regex(#"(\d+)\.?"#)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, there is a bug here that another bug is masking, but when that other bug is fixed, how do we fix this bug?

func testTypeErasedRegexInDSL() throws {
do {
let input = "johnappleseed: 12."
let numberRegex = try! Regex(#"(\d+)\.?"#)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can illustrate the difference with a substring input, where the subject bounds are the substring's bounds and the search bounds are contained within.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants