Parsing while preserving all whitespace (in an ergonomic fashion) #142

amake · 2023-01-17T13:14:42Z

amake
Jan 17, 2023

First, thanks very much to @renggli for this wonderful library, but also for helping out the community with the discussions here.

I have a PetitParser-based parser that I use in a viewer application. It parses to a collection of Dart classes comprising an AST.

I frequently get requests to support editing, which would require being able to round-trip a document through my parser, i.e. parse the document and then recreate it verbatim (modulo user edits) from the AST.

In particular the whitespace can be significant and is not feasibly computed even in an "opinionated" way a la gofmt or other formatter, so it must be retained from the original document.

One way would be to eschew whitespace-discarding mechanisms like trim() and explicitly retain every piece of whitespace. This is straightforward but seems very un-ergonomic.

Another way I guess would be to retain the original document and have the AST store offsets into the document; then user edits could be applied by replacing relevant ranges, leaving whitespace intact.

My questions:

Does PetitParser lend itself more to one of these approaches? Or to yet another approach? Can anyone with experience with this problem offer advice?

Answered by renggli

Jan 17, 2023

This is a great question and an interesting topic. I think you captured the two approaches very well:

I agree that retaining every piece of whitespace is probably the easiest. For example, this is what this XML parser selectively does in the parts that matter (between tags, there is a feature request to also do it between attributes). This makes it trivial to support modifications and to serialize the unmodified parts to an identical output.
As you mention, the other way is to store the offsets in the AST. This is what this Smalltalk Parser does. Instead of capturing the position manually using the position() parser, it uses the token() parser, which wraps all primitive values into a T…

View full answer

renggli · 2023-01-17T14:37:53Z

renggli
Jan 17, 2023
Maintainer

This is a great question and an interesting topic. I think you captured the two approaches very well:

I agree that retaining every piece of whitespace is probably the easiest. For example, this is what this XML parser selectively does in the parts that matter (between tags, there is a feature request to also do it between attributes). This makes it trivial to support modifications and to serialize the unmodified parts to an identical output.
As you mention, the other way is to store the offsets in the AST. This is what this Smalltalk Parser does. Instead of capturing the position manually using the position() parser, it uses the token() parser, which wraps all primitive values into a Token that include start and end offset of the input automatically. While this example does not support modifications, it can be done but is quit difficult to get right. The original implementation in Smalltalk tried to update the underlying input as long as possible, but feel back to pretty printing the code if this failed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing while preserving all whitespace (in an ergonomic fashion) #142

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Parsing while preserving all whitespace (in an ergonomic fashion) #142

amake Jan 17, 2023

Replies: 1 comment

renggli Jan 17, 2023 Maintainer

amake
Jan 17, 2023

renggli
Jan 17, 2023
Maintainer