Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

impl PartialOrd for Cursor #1196

Closed
wants to merge 1 commit into from
Closed

impl PartialOrd for Cursor #1196

wants to merge 1 commit into from

Conversation

CAD97
Copy link
Contributor

@CAD97 CAD97 commented Jul 8, 2022

This is primarily intended for use with discouraged::Speculative parsing in order to implement one of the suggested error strategies, namely to show the error from the speculative parse which consumed the most tokens.

Example use case

I am implementing a macro which allows declaring a syntax tree enum and automatically provides a PEG-style parse implementation which tries each variant in order. This is, in fact, the express reason I added advance_to back in 2019; I guess it took me a while to actually go back and try to finish the experiment.

Yes, I know that syn is deliberately biased to LL(3) parsing. In fact, my original macro was a much more complicated grammar specification to allow it to describe LL(3) lookahead, but that complexity came at a cost of being painful to work both on and with. The intent of the new vastly simplified macro just using PEG is to use declarative PEG where it is enough, but still allow (and encourage) writing and using manual syntax elements mixed in with the declarative elements. Additionally, PEG exhibits much the same performance characteristics as LL(3) when the described grammar is in fact LL(3), as each speculative parse terminates within the three leading tokens, though it does still suffer from worse errors.

The currently emitted parse implementation is

impl $crate::syn::parse::Parse for $Name {
    fn parse(input: $crate::syn::parse::ParseStream) -> $crate::syn::parse::Result<Self> {
        use $crate::syn::parse::discouraged::Speculative;
        let mut best_guess = ::std::option::Option::None;
        let mut best_guess_variant = "";
        let mut best_guess_remaining = ::std::primitive::usize::MAX;
        $({
            let fork = input.fork();
            match fork.parse::<$crate::__switch! {
                if { $($NameVariant)? }
                do { $($NameVariant)? }
                else { [<$Name $Variant>] }
            }>() {
                ::std::result::Result::Ok(v) => {
                    input.advance_to(&fork);
                    return ::std::result::Result::Ok($Name::$Variant(v));
                }
                ::std::result::Result::Err(e) => {
                    // TODO: ask syn for a Cursor::cmp to optimize this
                    let this_guess_remaining =
                        fork.cursor().token_stream().into_iter().count();
                    if this_guess_remaining < best_guess_remaining {
                        best_guess = ::std::option::Option::Some(e);
                        best_guess_variant = ::std::stringify!($Variant);
                        best_guess_remaining = this_guess_remaining;
                    }
                },
            }
        })*
        ::std::result::Result::Err(input.error(format_args!(
            "expected {} but failed to parse any variant; best attempt was {} with {}",
            ::std::stringify!($Name),
            best_guess_variant,
            best_guess.unwrap(),
        )))
    }
}

Of note is the failure case:

::std::result::Result::Err(e) => {
    // TODO: ask syn for a Cursor::len to optimize this
    let this_guess_remaining =
        fork.cursor().token_stream().into_iter().count();
    if this_guess_remaining < best_guess_remaining {
        best_guess = ::std::option::Option::Some(e);
        best_guess_variant = ::std::stringify!($Variant);
        best_guess_remaining = this_guess_remaining;
    }
},

In order to determine how far the forked ParseBuffer got, the first instinct would be to compare the .span()s. However, comparing spans doesn't work in proc macros, and does the wrong thing anyway, since the spans could have arbitrary relations due to macro expansion. So I necessarily fall back to the only other option: counting the remaining unparsed tokens. With the Cursor API, this is most easily done by creating a new .token_stream(), but could also potentially be done by counting the length of an iterator constructed from successively using .token_tree() to get an advanced cursor.

However, the Cursor implementation has the needed information to say which Cursor is more advanced already, via the buffer entry pointers. Exposing Cursor ordering in this manner allows determining which speculative parse got the furthest just by comparing the cursors:

::std::result::Result::Err(e) => {
    let this_guess_cursor = Some(fork.cursor());
    if this_guess_cursor > best_guess_cursor {
        best_guess = ::std::option::Option::Some(e);
        best_guess_variant = ::std::stringify!($Variant);
        best_guess_cursor = this_guess_cursor;
    }
},

An alternative is to add an API directly to ParseStream which allows comparing them, but it can't be part of Speculative, because we didn't seal the trait.

@CAD97
Copy link
Contributor Author

CAD97 commented Jul 8, 2022

(I'm up to date with master, so I don't think those CI failures are me? I cannot reproduce locally.)

@dtolnay
Copy link
Owner

dtolnay commented Jul 8, 2022

I am on board with a PartialOrd impl, but I don't think this implementation works. With a None delimited group, comparing the following two cursors would give a random answer depending on how the memory allocator arranged the token buffers.

1 «2 3» 4
     ^  ^

@CAD97
Copy link
Contributor Author

CAD97 commented Jul 8, 2022

Oh, I misunderstood how the cursor implementation works when I skimmed lightly. (Mainly because since bump only does a ptr increment I thought the buffer must be contiguous.) I'll take another look sometime next week to see if I can figure out a working implementation. And write some tests.

@CAD97 CAD97 force-pushed the cursor-cmp branch 2 times, most recently from 487c114 to 58cc1fd Compare July 25, 2022 00:15
@CAD97
Copy link
Contributor Author

CAD97 commented Jul 25, 2022

Given the tree nature of the TokenBuffer, there's no clever way to compare cursors directly. As such, the simplest and perhaps only way works out: just keep track of how many entries we've walked past, and compare that count.

If you give me a pointer on where/how to add a test, I'll add a couple.

src/buffer.rs Outdated
fn partial_cmp(&self, other: &Self) -> Option<cmp::Ordering> {
let Cursor { index, scope, .. } = self;
if *scope == other.scope {
Some(index.cmp(&other.index))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It does look odd to PartialEq over (ptr, scope) and PartialOrd over (index, scope), but this should be consistent.)

@dtolnay
Copy link
Owner

dtolnay commented Jul 25, 2022

I am not pleased that this implementation adds bookkeeping overhead to every cursor operation (which underlies all parsing) even in programs that never use Speculative. That seems like the wrong tradeoff. Would you be able to come up with any other ways to implement this using only the existing data structures?

@CAD97
Copy link
Contributor Author

CAD97 commented Jul 25, 2022

I'm not super happy with it either. At the same time, though, this is a fairly small cost, so it might not be problematic?

I think the only solution for comparing Cursors that's zero overhead when not comparing cursors is to count the remaining tokens, i.e. call bump until we reach Cursor::End (iter::successors(Some(c), Cursor::skip)). This is at least better than what can be done externally (iter::successors(Some(c), |c| c.token_tree().map(|(_, c)| c)).count()), so perhaps that's good enough for including it (or perhaps just exposing skip) and letting speculative parsing eat the cost.

@CAD97
Copy link
Contributor Author

CAD97 commented Jul 25, 2022

The reason I don't think anything better is possible is that cursors can be arbitrarily many None-delimited groups deep.

The best middle ground I can think of is to carry along a pointer to the base stream, compare that, and only walk if they have the same base pointer.

(This had a different idea previously but that was broken.)

@dtolnay
Copy link
Owner

dtolnay commented Jul 25, 2022

cursors can be arbitrarily many None-delimited groups deep.

In practice they are not.

You should be able to efficiently compute the depth of a cursor without using skip, by traversing the scope pointer to jump directly to the end of the current group, and then traversing the Entry::End pointer to go up one group.

@CAD97
Copy link
Contributor Author

CAD97 commented Jul 25, 2022

Well, I've found a sufficiently clever solution, I think. Clever enough that I'm not 100% confident betting on it without some good tests. But I think this handles all cases without falling back to counting skips, and works fast in the common case that one or both pointers are not in a None-group.

cases to test
# both rooted
« ∘ ● ∘ ● ∘ »

# rooted, group before
« ∘ « ∘ ● ∘ » ∘ ● ∘ »

# rooted, group after
« ∘ ● ∘ « ∘ ● ∘ » ∘ »

# rooted, group at eof
« ∘ ● ∘ « ∘ ● ∘ » »

# different group
« ∘ « ∘ ● ∘ » ∘ « ∘ ● ∘ » ∘ »

# increased nesting
« ∘ « ∘ « ∘ ● ∘ » ∘ » ∘ « ∘ « ∘ ● ∘ » ∘ » ∘ »

# all of the rooted cases, but in a group

...but this also lets me realize that this doesn't actually fully solve the problem that I want to solve, because if two speculative ParseStreams fail in separate positions in the same {} block, the Cursor held by both forks will still be to the same block.

Well, just call that a limitation. Even with my PEG-generating-macro, I still intend to also encourage enums to be hand-parsed in LL(3), rather than relying on PEG. (Also, Error::combine is new from the last time I was browsing the docs, I think; using that is perhaps the better approach, and offers an escape hatch if two forks do stop in the same token tree.)

@CAD97
Copy link
Contributor Author

CAD97 commented Jul 25, 2022

If this implementation looks okay, I'll replace range.contains and find a good way to put some tests in.

Comment on lines +422 to +425
// If we don't have the same scope end, the cursors are incomparable.
if self.scope != other.scope {
return None;
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, I didn't realize this is hard to handle. I wonder if this makes the whole impl fruitless for your use case. For example your "rooted, group before" or "different group" test cases, but with parens instead of None groups:

# rooted, group before
∘ ( ∘ ● ∘ ) ∘ ● ∘

# different group
∘ ( ∘ ● ∘ ) ∘ ( ∘ ● ∘ ) ∘

One would probably require the first black dot to compare less than the second black dot, but with this impl a < b and b < a would both be false.

One would need to know to write a < b && !(b < a) in order to get an actual meaningful comparison, or a.partial_cmp(b) directly.

Can we just safely keep going with both cursors regardless of scope? The very very outermost Entry::End contains ptr::null(), so we can look for that, instead of paying attention to scope at all. Basically my mental model is this impl shouldn't care at all about any distinction between non-None groups, explicitly entered None groups, and implicitly entered None groups. The only point of scope is distinguishing the implicitly entered None case from the other two (if I'm recalling correctly).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I realize the objection: that's inconsistent with PartialEq. If we ignore scope, PartialEq could say cursors are not equal while PartialOrd could say they are equal (same ptr, different scope).

I wonder what could possibly go wrong if PartialEq looked at ptr only.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay the existing PartialEq impl for Cursor is probably just wrong and bad. The only place in our codebase that uses it assumes it works the other way, and crashes with the current impl.

// [dependencies]
// syn = { version = "1", features = ["full", "extra-traits"] }
// proc-macro2 = "1"
// quote = "1"

use proc_macro2::{Delimiter, Group};
use quote::quote;

fn main() {
    // Okay. Rustc allows top-level `static` with no value syntactically, but
    // not semantically. Syn parses as Item::Verbatim.
    let tokens = quote!(pub static FOO: usize; pub static BAR: usize;);
    let file = syn::parse2::<syn::File>(tokens).unwrap();
    println!("{:#?}", file);

    // Okay.
    let inner = Group::new(Delimiter::None, quote!(static FOO: usize = 0; pub static BAR: usize = 0));
    let tokens = quote!(pub #inner;);
    let file = syn::parse2::<syn::File>(tokens).unwrap();
    println!("{:#?}", file);

    // Parser crash.
    let inner = Group::new(Delimiter::None, quote!(static FOO: usize; pub static BAR: usize));
    let tokens = quote!(pub #inner;);
    let file = syn::parse2::<syn::File>(tokens).unwrap();
    println!("{:#?}", file);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I "fix" equality to only care about ptr as well? This is safe, comparing cursor positions ignoring "partial eof" does seem generally more useful (and certainly is for the "which cursor is further along" query1).

If we do decide to do comparison ignoring "partial eof," I think the routine should be

  • If self.ptr == other.ptr, return Equal.
  • Walk self.ptr to self.scope, recording traversed ranges.
    • If other.ptr is in these ranges, return Less. (This step is purely an early-out optimization.)
  • Walk other.ptr to other.scope, recording traversed ranges (iff self.scope != other.scope).
    • If we enter self's traversed ranges, return self_range.start <=> other_range.start.
    • If self.scope == other.scope, unreachable!().
  • Walk self.scope to End(nullptr), recording traversed ranges (iff other.scope != End(nullptr)).
    • If we enter other's traversed ranges, return self_range.start <=> other_range.start.
  • Walk other.scope to End(nullptr).
    • If we enter self's traversed ranges, return self_range.start <=> other_range.start.
  • return Incomparable.

since compared cursors being in the same scope is still a common case worth optimizing for, I think.

This is IIUC $O\left( d_{self} b_{self} + d_{other} b_{other} + d_{self} d_{other} \right)$ time and $O\left( d_{self} + d_{other}' \right)$ space (where $d$ is cursor depth, $d'$ is cursor depth-to-scope (negligible), and $b$ is block length), and I think optimal big-$O$ only achieves that complexity with $d' = 0$ (by walking self all the way first).

More in the style of big-$O$, that's $O(d^2)$ time and $O(d)$ space. The polynomial is unfortunate but required without a tree restructuring2 or storing sufficient-for-comparison information on the cursors3. The tree restructuring is a big ask, and storing more on Cursor is penalizing normal LL(3) consumers for niche consumers who want to compare cursors.

Just counting tokens-to-end is $O(2n)$ time, $O(1)$ space; $n$ is the flattened length of the token stream. Since $d \approx \log n$, the comparison walk is $\approx O(\log^2 n)$ and better than the trivial tokens-to-end impl in nearly all4 cases.

The interesting thing to note, though, is that what is actually wanted is [Cursor]::sort, not Cursor::partial_ord, and comparing for sort can skip recalculating the walk-to-End(nullptr) list for each cursor on each sort. Though IIUC this won't show up in big-$O$, it could be a significant improvement in practice.

Footnotes

  1. Although it turns out my use case doesn't have that cursor; a parse stream's cursor is IIUC in an unspecified state after parse error and in practice is after the block that the error occurred in, not at the position of the error. Providing the cursor with the error is impossible since Error: 'static (and that's a good thing; much better than the relatively minor annoyance of ).

  2. e.g. a flat representation that makes cursor comparison purely self.ptr <=> other.ptr.

  3. e.g. an index of tokens yielded since the beginning of the root buffer, and a pointer to the root buffer to check they're comparable if you want to not just give consistent-but-meaningless results for comparing cursors in different root TokenBuffers.

  4. The exceptions also being somewhat mitigated by early-exit, since when $d \not\ll n$, much of the nesting is likely shared. Exceptions are constructed and not likely in practice.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I "fix" equality to only care about ptr as well?

Yes I think we should change this. I would be shocked if anyone had found a use for the current behavior, and I would be shocked if changing it wouldn't fix a significant fraction of places the impl is used, much like it fixes our own buggy use.

@dtolnay
Copy link
Owner

dtolnay commented Jul 27, 2022

As for tests: someday we should organize those but for now you can create a new top-level integration test file in the tests dir. Thanks!

@CAD97
Copy link
Contributor Author

CAD97 commented Jul 27, 2022

Sorry for stumbling in with an unexpectedly difficult problem 🙃 I honestly originally thought that self.ptr <=> other.ptr was sufficient1...

I think I've determined my original use case isn't served by Cursor comparison2, but I'm still happy to see this PR to completion because it could be useful in other cases3 and just generally seems like something the API should provide.

I've got some time this weekend to pursue whichever option you think is most worth trying. (Walk-to-End(nullptr) and compare, that but cached for sort, or refactoring TokenBuffer to a completely flat buffer.)

After writing out this and the inline comment, I'm actually quite partial to trying for a flat tree repr...

Footnotes

  1. Giant4 refactor to make it sufficient: "just5" serialize each nested group into the same contiguous buffer as they're encountered, and instead of Entry::Group(TokenBuffer) and Entry::End(*const Entry), have Entry::Group(usize) and Entry::End(usize), where the usize is the offset to the matching group start/end. The observation that makes this possible for TokenBuffer but not TokenStream is not doing consuming iteration handing out TokenStream to the group insides. This could be positive or negative on perf, though I'd give a slight odds for it being positive to store the whole thing in one buffer.

  2. There's no way to get the Cursor of the failing location from a failed ParseStream::parse, so the forked failed Cursor the ordered choice parser actually has is fairly uselessly pointing to after the block the failure occurred in (but this is unspecified behavior).

  3. Consider e.g. instead of PEG's ordered choice, taking the successful arm from multiple which consumed the greater number of tokens.

  4. Though in theory, actually not that complicated. I've done this refactoring before with other immutable trees, and it's often surprisingly simple if nothing goes wrong. (And if something goes wrong, it's usually just not possible to do the transform.) If it works, it might even be simpler than the scheme for comparing Cursors into the treed tree implementation.

  5. With the scariest scare quotes possible.

@dtolnay
Copy link
Owner

dtolnay commented Jul 27, 2022

I'd be on board with pursuing a flat representation. I think your assessment is correct that it has odds in favor of performance positive, and obviously it makes PartialOrd easy and fast.

I am also open to exploring possible ways to get a more meaningful cursor from a failed parse on a fork.

@dtolnay
Copy link
Owner

dtolnay commented Jul 27, 2022

Oh for the flat representation work: make sure to fetch from master since I was poking around in TokenBuffer construction recently, since it had to be paged in for reviewing this PR anyway.

@CAD97
Copy link
Contributor Author

CAD97 commented Sep 28, 2022

Closing in favor of the much simpler #1223

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants