Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Request: Expand processor support to automatically process source maps #17398

Closed
1 task done
matwilko opened this issue Jul 21, 2023 · 8 comments
Closed
1 task done
Labels
archived due to age This issue has been archived; please open a new issue for any further discussion core Relates to ESLint's core APIs and features enhancement This change enhances an existing feature of ESLint

Comments

@matwilko
Copy link
Contributor

ESLint version

8.45.0

What problem do you want to solve?

Processors as currently specified are required to perform all the work of mapping to plain JS, and then mapping ESLint's error messages back to the source file. That keeps ESLint core tight and allows arbitrary files to be processed, it's a great extension point.

But it can be daunting to figure out the post-processing step, and mapping back from the intermediate files to the source file.

What do you think is the correct solution?

I'd like to propose expanding the processor functionality by adding support for returning Source Maps alongside the pre-processed files from preprocess.

If a source map is provided, then ESLint could automatically resolve the source location of its own warning messages, and the file/messages would not need passing back to postprocess. I believe this means it could also automatically apply any fixes, though I'm not 100% on that!

In the best-case, where every file returned from preprocess has an associated source map, postprocess wouldn't even need to be called - there'd be nothing for it to do.

This could look like the following:

{
	preprocess: function(text, filename) {
		// here, run whatever pre-processing you need, generating
		// extra files and a sourcemap mapping that file back to the source
		
		return [ // return an array of code blocks to lint
			{ text: code1, filename: "0.js", sourcemap: sourcemap1 },
			{ text: code2, filename: "1.js", sourcemap: sourcemap2 }
		];
	},

	// no post-process, all files returned will be auto-processed by source map
}

This would mean that if users wanted to integrate other tools that already generate source maps as processors, this would be almost trivial to accomplish 😄

Given that mozilla provides a well-supported library for processing source maps, I don't think this would be too complex to integrate into ESLint.

Participation

  • I am willing to submit a pull request for this change.

Additional comments

I understand that this is already possible to accomplish today by processor developers by writing in the support themselves and keeping their own map of files -> sourcemaps that preprocess/postprocess share. But I think giving this first-class support could make that implementation a lot cleaner, and could encourage new processor development.

Some of this might be eliminated by #16999, but equally implementing an entire language plugin might be more daunting if the same result could be accomplished via a processor and source maps 😄.

The motivating example for this was expressed in #17380 and #17397, where source map support could make this sort of pre-processor easier to implement as a processor in ESLint.

@matwilko matwilko added core Relates to ESLint's core APIs and features enhancement This change enhances an existing feature of ESLint labels Jul 21, 2023
@nzakas
Copy link
Member

nzakas commented Jul 25, 2023

I seem to recall discussing this previously at some point but can't find the reference at the moment. I know we have definitely looked at working with source maps several times in the past but always came to the conclusion that it was actually more complex than it was valuable.

In the case of processors, it's debatable that sourcemaps would be a simpler solution for reconstructing source code locations, and we'd then need to support two ways of doing the same thing for at least one major version.

I'm not opposed to exploring further, but I'd love to see a prototype-y bit of example code that shows what the experience would be for processor writers. Maybe look at eslint-plugin-markdown's processor as the example?

@btmills if you're still hanging around I'd be interested in hearing your thoughts on this.

@matwilko
Copy link
Contributor Author

Sure, happy to prototype when I have a bit of time.

It'll at least give an opportunity to flesh out how feasible auto-fixing might be with sourcemaps, or whether that's the sort of thing that's going to need to be configurable by the processor.

e.g. I can see the markdown processor being fine with autofixing because it just pulls out whole chunks of code; but for the XUL processor case where there's much more complex logic that could be easily broken by autofix transforms it would probably need to disable autofixing and only highlight issues.

I'll focus on "can we map issues back without too much complication?", and then we can discuss whether to move forward to an RFC 😄

@nzakas
Copy link
Member

nzakas commented Jul 28, 2023

You can probably start with an even simpler prototype: rewrite the markdown processor so that it creates and stores a sourcemap in preprocess() and then uses that sourcemap in postprocess() to output the updated location information. That way, you can test with ESLint as it is today to validate that the sourcemap approach can produce similar results.

@matwilko
Copy link
Contributor Author

Yup, that was my intention, no point using a lot of time poking at ESLint core if the proposal might not make it to RFC 😄

@btmills
Copy link
Member

btmills commented Jul 29, 2023

I don't recall participating in conversations about source maps myself, but I did find a few previous threads.

I naively approached the Markdown processor's mapping assuming that adjustments would be straightforward "subtract $x$ columns and $y$ lines from each location." In that world, source maps would be enormous unnecessary complexity.

…Then I had to handle under-indented lines that start before column $x$. Then we added configuration comments that are part of the linted code but not part of the code block, affecting the $y$ lines shift. With all of the nuance, the postprocess translation logic is only a dozen and a half statements but more than 100 lines with explanatory comments.

Source maps have many more features than the Markdown processor requires. If source maps can eliminate the existing translation complexity without adding undue complexity of their own, that would reduce a significant barrier for building processors. A prototype is a great way to see if that plays out. There are extensive tests for the different cases that'll indicate whether the prototype is equivalent.

@matwilko
Copy link
Contributor Author

matwilko commented Aug 4, 2023

OK, so while I was initially optimistic about the capabilities of source maps, after a bit of experimentation and a closer look at exactly what a source map is representing, it looks like this isn't going to be viable to implement in a general way that ESLint could handle by itself.

Just in case source maps come up again in the future, here's the problems with this approach I've found:

Source maps do not create a direct mapping between all valid line/column positions in the generated and source code, nor do they map ranges of positions. They will (in general) map any column on a line to the nearest explicit mapping backwards on that line.

The expectation is that mappings are generated for each parsed token, and the mapped line/column will refer to the start of the same token in the source and generated code. This means it is not possible to map from an arbitrary position in the generated code back to its exact corresponding position in the source - it can only map to the nearest mapped token position.

This means that to generate a source map, it requires more than knowing what portions of the document you need to map into the generated code - a full parse is required so you can map known tokens of whatever language you are using.

But crucially, it also means that the reverse mapping also requires this parse information - since the source map only gives the start position of any given token, the parsed source information is required to actually make sense of what any position mapping (especially one that is a range) is trying to refer to in the source.

As a simple, but hopefully demonstrative example, if we have the following:

// hypothetical source
<code>
var someVariable = someFunctionCall("something something tyqo something");
</code>
// generated
var someVariable = someFunctionCall("something something tyqo something");

There's a few ways we might think about constructing a source map:

  1. Do it "properly", and fire up a JS parser on the code between the <code> tags and get a proper AST, generating the intended mapping of tokens between source/generated docs. This would look vaguely like this (not an actual source map, just to demonstrate in shorthand): L1C1->L2C1, L1C5->L2C5, L1C18->L2C18, L1C20->L2C20, L1C36->L2C36, L1C37->L2C37, L1C73->L2C73, L1C74->L1C74

  2. Naively assume that a source map will map positions on a line properly when the lines are identical, and just generate L1C1-> L2C1.

  3. Generate a "pathological" source map, because we know we have a 1-1 mapping, and create exact line/column mappings: L1C1->L2C1, L1C2->L2C2,...

Now let's look at what happens if we had a hypothetical rule that highlighted the typo in the string, e.g. { start: { line: 1, column: 58 }, end: { line: 1, column: 62 } }.

  1. Both start and end would map to { line: 2, column: 37 }, that is, the start of the string token. This would then need the information in our AST to deduce the range of the string token and properly map start and end in a useful way.

  2. Both start and end map to { line: 2, column: 1 }. Completely useless.

  3. Both start and end do correctly map to their exact positions in the source.

So, why don't any of these work for ESLint?

  1. While it's possible to make the mapping make sense in this case, ESLint can't be the one to do it - it can't support arbitrary source-side parsers and ASTs, or know exactly what extra manipulations are needed to make the mapping exact. This may change with the upcoming language changes? It might be something that could be designed in to that system perhaps?

  2. No further explanation really needed. Not actually mapping properly at all.

  3. While this works if you squint, it is only useful for situations where you have a direct mapping between source and generated code, and you're willing to create a huge, bloated source map. It's also not how any other tool is going to provide a source map - half the point of the feature was to be able to reduce effort when other tools created source maps alongside their generated code, and the processor developer could simply pass it along. No real world tool is going to produce this kind of mapping.

So, TLDR, source maps aren't a magic mapping tool that will allow ESLint to map line/column positions automatically for any arbitrary generated code. They would require ESLint to have some knowledge of the structure of the original source, and the whole point of processors is that they're keeping anything non-Javascript at arms length from ESLint.

@matwilko
Copy link
Contributor Author

matwilko commented Aug 4, 2023

Having gone through all this though, and given the complexity present in the markdown processor to manage what is mostly a 1-1 mapping of JS from the source to the generated code, I think there's potentially room for an alternative here that could still help eliminate a lot of the finicky mapping code.

As a broad stroke, the idea would be that instead of providing a string as the processed code, the processor could create a new GeneratedCode() object, and then incrementally write generated source to it through various methods that would internally maintain the mapping between source/generated positions.

This could then allow ESLint to perform the exact mapping when possible, and apply auto-fixes if the fix covers a contiguous mapped region. Any problems/fixes that can't be automatically mapped would then be passed to postprocess, so the processor only has to deal with corner cases.

Imagined example for the markdown processor:

function generateMappingForCodeBlock(generatedFileName, text, node, directives) {

    const generatedCode = new GeneratedCode(generatedFileName, text);

    for (const directive of directives) {
        generatedCode.writeUnmapped("/* ");
        generatedCode.writeFromSource(directive.position);
        generatedCode.writeUnmapped(" */\n");
    }

    const baseIndent = getBaseIndent(text, node);
    const startLine = node.position.start.line + 1;
    const endLine = node.position.end.line - 1;

    if (baseIndent === 0) {
        // The entire block is mapped 1-1, so just write the whole thing.
        generatedCode.writeFromSource({ start: startLine, end: endLine });
    } else {
        // The block is indented, so we need to write each line individually.
        for (const lineNo = startLine; lineNo <= endLine; lineNo++) {
            generatedCode.writeFromSource({ start: { line: lineNo, column: baseIndent + 1 }, end: lineNo });
        }
    }

    return generatedCode;
}

Obviously there's some conventions I've introduced in there about exactly how you can specify positions by eliding column info, that's to make things a bit easier on the developer by punting things like "what's the last column of the line?" into the code generator. And note that at no point does the developer have to think about the position information in the generated source - it's all tracked by GeneratedCode, and the developer just has to keep writing out code to it.

preprocess would then return the GeneratedCode object in place of the usual { text, filename } object.

Internally, the code generator keeps track of mappings from the generated code to the source, handling intricacies like writing multiple lines at once, or how long lines are etc, and can then both generate the final processed code for ESLint and exactly map (hopefully most) problems back to the source text.

In this case, the markdown processor would then only need to handle the edge cases in postprocess:

  • a codeblock has an indent, and a problem/fix spans more than one line - so the non-contiguous indent region needs to be accounted for
  • a problem highlights an entire directive comment and needs mapping to the html comment

Obviously, this would get much more complicated if there's a more complex transformation going on than picking bits of JS code verbatim out from other file formats, but I think what's proposed above potentially covers a lot of what processors were originally designed to handle, and if more complex transformations are needed, they can just drop back down to manual handling.

@nzakas
Copy link
Member

nzakas commented Aug 11, 2023

Thanks for all of the information on source maps, that's very helpful and will undoubtedly help when future discussions come on up.

For the GeneratedCode idea -- let's split that off into a separate issue.

@nzakas nzakas closed this as not planned Won't fix, can't repro, duplicate, stale Aug 11, 2023
@eslint-github-bot eslint-github-bot bot locked and limited conversation to collaborators Feb 8, 2024
@eslint-github-bot eslint-github-bot bot added the archived due to age This issue has been archived; please open a new issue for any further discussion label Feb 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
archived due to age This issue has been archived; please open a new issue for any further discussion core Relates to ESLint's core APIs and features enhancement This change enhances an existing feature of ESLint
Projects
Archived in project
Development

No branches or pull requests

3 participants