docs(rfc): An alternate RFC for common analyzer/driver interface. #3339

jaysachs · 2018-12-21T21:39:30Z

This document serves to satisfy issue #3306, and is an evolution of RFC 2966. It defines the basic terminology, channel framing, message formats, and methods for a driver and analyzer to communicate uniformly via pipes or other reliable byte streams for the "control plane". File contents and analyzer output are satisfied via the analyzer's filesystem, presumably virtualized to some degree by the driver. It furthermore attempts to apply to less kythe-specific applications.

kythe/docs/rfc/3339.md

kythe/docs/rfc/2966.md

kythe/docs/rfc/3339.md

creachadair · 2018-12-22T09:03:43Z

kythe/docs/rfc/3339.md

+    as well as the output file to which to emit analysis output. It
+    reflects some of the kythe.proto.CompilationUnit structure. The
+    driver (and driver environment) MUST make contents available for
+    each file referenced in the compilation record at the file system


This is actually a little tricky. The compilation unit records physical paths (in the info field), as they were spelled by the build system at the moment when the data were captured. Those paths often assume the existence of a file tree with symlinks pointing outside the repository root.

To actually set that up you'd need to solve a set of constraints: Your virtual (or actual) filesystem root has to be high enough that all the up-pointers have something to traverse, and that down-steps are correctly populated within that hierarchy.

If the analyzer already virtualizes file access, then this is easy: You just map the path it requests directly to the data in the compilation record. But allowing the analyzer to access the filesystem through the system API means you have to actually create those structures (links, directories, etc.) and get their access permissions semi-right.

This isn't a dealbreaker, but it is an aspect of this design that requires much more detail that the old one didn't need.

I want to make sure I understand your point here. If you're "simply" saying that the driver may need to virtualize an arbitrary portion of the filesystem, I agree. That's effectively what the in-analyzer file system virtualization is doing.

What I (naively) imagined is that we'd set up the virtualization mapping precisely for every path mentioned in the compilation record. I believe(d) that this is independent of symlinks, or permissions -- that the driver virtualization gets first crack at any path.

If that's not the case, i.e. the driver is further downstream, then I agree there are some more subtleties here.

[T]he driver may need to virtualize an arbitrary portion of the filesystem

It's a bit worse than that, though that is the basic problem.

Compilation records in general include paths that are aggregated from multiple locations in the workspace, and are accessed via symlinks and relative paths. To virtualize this correctly, the driver must compose an overlay view of multiple filesystems as they were organized at the time of extraction, faithfully implement symlinks and permissions to the point where the VFS will return correct results to the kernel, and ensure that the process working directory for the analyzer is set up correctly.

In other words, you basically have to recreate a container image of the original compiler execution. You can of course do so—but at that point it seems like you're better off just using an image (e.g., docker, say) with layered dependencies. Trying to shim the filesystem calls with (say) gVisor is a lot of work and will, I predict, be a maintenance nightmare.[1] Using docker images would obviously be a better solution for the way the open-source world builds, but it will make life harder for Google internal builds which don't expose any of that.

Having the analyzer mediate file lookups is actually much simpler and easier to get right, though as you pointed out it means each analyzer has to do it. Sometimes (usually?) that can be simplified with libraries, since often the analyzer doesn't need filesystem semantics at all—just a mapping from path to data. In Clang, for example, this is just an in-memory string-to-string map (albeit with a bad API), and once it's set up the analyzer never is self-contained.

Just in terms of maintenance, I'd argue it would be better give up on providing files "on demand" than to take on ownership of a filesystem shim sophisticated enough to recreate the execution environment at extraction time. Image construction and management is a big problem and isn't really the Kythe project's concern. At this level of detail we should use tools off-the-shelf rather than rolling our own.

What I don't fully understand yet is how much additional context that would require us to carry around between the build system and the analysis tools: The existing implementation was intended to minimize the amount of context that has to be propagated, but that may not be an important enough goal to keep.

[1] I have a lot of experience with this kind of code, and would be glad to discuss it in more detail offline of you like. :)

I've spent a lot of time thinking about this, and I don't see the issue. But I'm new here, so clearly there's plenty of room for ignorance and misunderstanding ...

The compilation record stores, along with the byte content, the actual path (possibly relativized, and maybe bastardized for e.g. jar files) requested from the host OS, i.e. before any symlink/hardlink/overlay/blah is traversed/resolved. It's precisely that path that the virtualization layer will intercept requests for. Why is there anything to worry about here?

Why is there anything to worry about here?

When you virtualize the filesystem interface, you have to participate in path traversal and fulfill stat requests. Permissions, link counts, file types, and so on will need to be made "right" (or at least "right enough") to avoid spurious and hard-to-debug failures.

As far as the content there isn't any problem—but whereas explicitly virtualizing the lookup interface inside the analyzer allows you to treat this as a simple map lookup, emulating the real filesystem handlers requires much more mechanism. You're right, of course, that can be done, my point is that the spec will need to be much more explicit about the rules. When it goes wrong, the modes of failure arising from bad stat can be subtle.

As a case in point: GNU find trusts that the link counts it reads from the stat record are correct, and will only traverse a prefix of a directory whose link counts are wrong.

I wonder how much this actually occurs in practice, though.

I don't know. But I have seen some compilers do some odd things with the filesystem in order to obtain locking, memory-map files, and so on. Hopefully most of that stuff is ancillary to reading input files and can be ignored—but I do think this spec will need to spend some more time laying out the rules for what the virtualization promises (and thus what the analyzer can rely on). When we were going to make the analyzer figure out how to inject itself into the I/O mechanism of the underlying compiler explicitly, these things didn't need much discussion.

Implementing any and all of the weird POSIX or non-POSIX-but-damn-the-torpedoes corner cases expected by compilers-used-to-implement-analyzers is unlikely to be feasible. In the real world, I expect most compilers will fall into one of three categories:

Compilers which do weird corner-casey-crap but have a Filesystem abstraction which can be implemented by the analyzer to not do so (or use a custom protocol directly as would be required by RFC2966).

Compilers which do straightforward create/write open/read operations (but have no filesystem abstraction)

Lost causes

Both this RFC and RFC2966 can support 1 with roughly the same amount of work, but this or something similar is required for 2. Given this I'd be very tempted to piggyback on the operations supported by something like 9P2000 if only because there are a broad set of implementations on a lot of platforms and it has better specified (and more limited) semantics. The "file operations" specified by RFC2966 are unlikely to be sufficient for pretending to be a filesystem. That said, I suspect including an RPC for emitting analysis output rather than necessitating the using the filesystem for output (rather than only input) will be simpler and sufficient.

The driver will also have some limited ability to rewrite the paths it sends to the analyzer. It should also have a pretty good idea of what files the analyzer may try to open as it shouldn't be opening anything not in the CU.

I'll admit that the main reason I went with filesystem for output rather than RPC is pushback-on-multiple-encodings. I don't have a strong reason for pushing the filesystem as long as the output encoding can be negotiated (and efficiently supports proto delimited, in addition to JSON), except it keeps the rpc protocol simple and pure text.

It is certainly true that using a filesystem for output is a separate concern from using a virtual filesystem for input. I think the latter is the better-motivated case, and doesn't strictly need to be tied to the former.

(edit: fix typo)

Either way though, my main point here was that the spec should probably say more about what the VFS will provide.

kythe/docs/rfc/3339.md

shahms · 2019-01-03T17:00:25Z

kythe/docs/rfc/3339.md

+    as well as the output file to which to emit analysis output. It
+    reflects some of the kythe.proto.CompilationUnit structure. The
+    driver (and driver environment) MUST make contents available for
+    each file referenced in the compilation record at the file system


Implementing any and all of the weird POSIX or non-POSIX-but-damn-the-torpedoes corner cases expected by compilers-used-to-implement-analyzers is unlikely to be feasible. In the real world, I expect most compilers will fall into one of three categories:

Compilers which do weird corner-casey-crap but have a Filesystem abstraction which can be implemented by the analyzer to not do so (or use a custom protocol directly as would be required by RFC2966).

Compilers which do straightforward create/write open/read operations (but have no filesystem abstraction)

Lost causes

Both this RFC and RFC2966 can support 1 with roughly the same amount of work, but this or something similar is required for 2. Given this I'd be very tempted to piggyback on the operations supported by something like 9P2000 if only because there are a broad set of implementations on a lot of platforms and it has better specified (and more limited) semantics. The "file operations" specified by RFC2966 are unlikely to be sufficient for pretending to be a filesystem. That said, I suspect including an RPC for emitting analysis output rather than necessitating the using the filesystem for output (rather than only input) will be simpler and sufficient.

The driver will also have some limited ability to rewrite the paths it sends to the analyzer. It should also have a pretty good idea of what files the analyzer may try to open as it shouldn't be opening anything not in the CU.

kythe/docs/rfc/3339.md

… flesh out AnalyzeRequest to include sufficient content from the CompilationUnit.

kythe/docs/rfc/3339.md

… analysis ID.

robinp · 2019-01-15T22:59:13Z

kythe/docs/rfc/3339.md

+
+-   `vname(VNameRequest)`
+
+    Requests the complete VName to emit for the specified path and signature.


Could detail a bit in which situation is this useful? How does the analyzer get hold of a path+signature, but can't construct the VName itself?

Later example indicates the analyzer doesn't need to know the corpus/root upfront anymore. Is that the motivation?

Yes, it relieves the analyzer from having to thread through those two fields. More significantly, though, it opens up the possibility to handle the case when semantic entities "belong" to some other corpus and/or root: one could imagine wanting to associate files of dependencies with some "well known" corpus name, or that build-time generated files should have some specific root. In these cases, some component needs to "understand" the mapping of path to corpus/root, and it makes sense to insulate the myriad indexers from how to determine that mapping and encapsulate the implementation in the driver.

The analyzer was already freed of the need to reason about vname mapping by the compilation record. As I understand it, the real change here is providing that information dynamically during the analysis instead of ahead-of-time. That's actually an interesting change, and I'm not entirely clear on the motivation for it.

We talked about various ways to solve this issue in the original implementation, and settled on making the build system capture that knowledge in the CU (that's what the vname fields are for, in the CU and its required inputs). I'm not sure what the argument for requesting it on the fly is, but it's a viable approach.

That said: We should probably discuss that question.

Right, that was too early in the AM for a reply. The per-file VName still would be present in the CU proto. The real reason for this API is that with the files being read from the file system, the only remaining motivation to expose the raw compilation unit mapping (FileInput) to the analyzer would be to supply the VName overrides. Thus the introduction of this API to encapsulate things. While it does permit dynamic non-CU-supplied VName construction, that wasn't its primary intention.

Ah! Yes, that makes sense.

It also turns out that it's incredibly common to get this mapping wrong, particularly for generated files and (presumably) files from a different corpus. Indexer authors routinely emit references using the raw "path" disregarding the CU/required_input VName, often up to hard-coding various values for "corpus" as well. Abstracting this behind an interface should substantially help with this problem.

That said, I think the generic vname() interface may still be a bit too low-level. There are a variety of other node kinds which have structural, factual or other VName requirements (tapp comes to mind), so it might be nice to express those requirements via an RPC/proto interface in some fashion.

But file/anchor VNames are the low-hanging fruit here.

kythe/docs/rfc/3339.md

clconway · 2019-01-23T19:51:52Z

kythe/docs/rfc/3339.md

+    The output file should contain the content appropriate to the
+    analysis-type, encoded as specified in the `InitRequest`. For Kythe index
+    analyses, this would be `kythe.common.Entry` objects. Other analyzers might
+    emit a "findings" structure.


This freedom is going to add complexity. Maybe the canonical output should be a gRPC Any proto? Or a finding proto with an Any field that can be used to encode Entries, etc.?

Where do you see the complexity? I'd expect the driver would know a priori what type of data to expect.

This freedom does add complexity, but of a kind that is probably desirable: One of the goals of the common analyzer interface is to support analyzer tools beyond just Kythe indexers. (There is, or was, some language about that in the introduction, but it might want some more colour).

That said: The default handling of google.protobuf.Any is problematic in some important ways, and I would advise against extending our use of it.

Maybe if we can get more people piling on protocolbuffers/protobuf#5226 we can get a better resolution there (I'm not optimistic, but still...)

clconway · 2019-01-23T19:52:45Z

kythe/docs/rfc/3339.md

+
+        » {"jsonrpc":"2.0","method":"done","params":{"message":"success"}}
+
+### Errors


How does the analyzer communicate that it had a unrecoverable internal error?

The current proposal is that it terminates with a non-zero exit code. Do you see cases where an analyzer would want to abort the current analysis (e.g. due to unrecoverable malformed input) but continue to be able to request analyses, without restarting? Is that common enough that it warrants additional complexity?

I more had in mind that the analyzer could return a stack trace or other rich description of the error condition, which could be cleaner than scraping stderr.

While some analyzers may be able to return structured data for internal errors quite a few of them will segfault and (if lucky) print a stack trace on stderr, so if we want to capture that we'll have to scrape stderr regardless.

There are different ways an analyzer can fail. If it fails under control, it can report an error in-channel, which along with the "done" message. If it fails out-of-control, however, the program will simply exit. The protocol needs to account for both.

Even when an analyzer tries to control failures, it can't always. Even in a managed runtime the OS may kill a process for exceeding resource limits or poor use of FFI. Moreover some compiler toolchains do not manage memory well (GHC, Python, Dart, and SBCL for some examples we've encountered in the past) and will pile up state and garbage if you re-use the analyzer without an exit in between.

So at some point you have to deal with the fact that the program either exits uncontrollably, or exits because it is simpler to exit and restart than to rewrite the compiler. I think the protocol should deal with that as simply as possible—at which point reading the exit code is probably the best we can hope for. Scraping stderr is probably not useful for an in-band message, though it makes sense to capture stderr for logging purposes.

clconway · 2019-01-23T19:55:03Z

kythe/docs/rfc/3339.md

+    driver MAY support a subset of read-only directory operations, e.g. in order
+    to provide directory listings via `readdir(3)` or equivalent.
+
+-   For all other paths, the driver MAY fall back to the underlying host


I may be misunderstanding: does this mean to imply that paths in the compilation record MUST NOT fall back to the underlying host? Or is it just saying that access to the filesystem is not necessarily limited to compilation record paths?

The latter. Does the "MAY" not convey that sense sufficiently? I'm open to other wording.

I am perhaps being overly literal and getting myself confused. Since falling back to the underlying host is not mentioned above, one might infer it is not allowed. It may help to replace "For all other paths..." with "For paths that are available via the host's local filesystem..."

kythe/docs/rfc/3339.md

Fully specify delimited proto as varint.

shahms · 2019-04-24T15:13:30Z

kythe/docs/rfc/3339.md

+    analyzers for which the best, or only, mechanism for clearing state is to
+    restart.
+
+### Messages


(aside: it would be nice to specify the protocol using proto, even if the implementation is JSON-RPC)

Yeah, that's what kythe/proto/driver.proto was meant to represent, but I think it may be out-of-date with the current proposal.

shahms · 2019-04-24T15:19:30Z

kythe/docs/rfc/3339.md

+
+-   `vname(VNameRequest)`
+
+    Requests the complete VName to emit for the specified path and signature.


It also turns out that it's incredibly common to get this mapping wrong, particularly for generated files and (presumably) files from a different corpus. Indexer authors routinely emit references using the raw "path" disregarding the CU/required_input VName, often up to hard-coding various values for "corpus" as well. Abstracting this behind an interface should substantially help with this problem.

That said, I think the generic vname() interface may still be a bit too low-level. There are a variety of other node kinds which have structural, factual or other VName requirements (tapp comes to mind), so it might be nice to express those requirements via an RPC/proto interface in some fashion.

But file/anchor VNames are the low-hanging fruit here.

shahms · 2020-10-28T16:34:17Z

Jay, is this still relevant?

jaysachs · 2020-10-28T20:25:27Z

Yes it is, though very low priority.

creachadair · 2020-10-28T21:26:54Z

Yes it is, though very low priority.

Priority notwithstanding, I'd be interested in helping some version of this happen. I'd be happy to serve as a sounding-board for ideas if that would help.

googlebot added the cla: yes label Dec 21, 2018

jaysachs changed the title ~~An alternate RFC for common analyzer/driver interface.~~ docs(rfc): An alternate RFC for common analyzer/driver interface. Dec 21, 2018

jaysachs requested a review from schroederc December 21, 2018 22:29

schroederc reviewed Dec 21, 2018

View reviewed changes

kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved

schroederc requested a review from zrlk December 21, 2018 22:48

creachadair reviewed Dec 22, 2018

View reviewed changes

kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved

kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved

creachadair reviewed Dec 22, 2018

View reviewed changes

shahms reviewed Jan 3, 2019

View reviewed changes

jaysachs added 11 commits January 7, 2019 10:23

Initial version of new RFC

e032b60

undo

00fe992

fork the rfc

0b7ad5d

Initial version of alternate RFC for common analyzer/driver interface.

8f15590

Minor updates

26c4381

Add support for vname call, switch to camelCase for json field names,…

478f56a

… flesh out AnalyzeRequest to include sufficient content from the CompilationUnit.

Remove spurious commit

94ca473

Minor typo fixes; update grammar for analysis type names.

e7b713f

Add output description.

c47ad69

Describe the input filesystem virtualization

8193c99

Add an item on analyzer exiting with 0. General reformatting.

5f7e276

jaysachs force-pushed the analyzer-rfc branch from 8a0be62 to 5f7e276 Compare January 8, 2019 18:18

Use varint; use markdown links.

708d22d

jaysachs commented Jan 10, 2019

View reviewed changes

kythe/docs/rfc/3339.md Show resolved Hide resolved

Explicitly require analyzers to be non-concurrent. Remove unnecessary…

93909a2

… analysis ID.

robinp reviewed Jan 15, 2019

View reviewed changes

clconway reviewed Jan 23, 2019

View reviewed changes

kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved

clconway reviewed Jan 23, 2019

View reviewed changes

kythe/docs/rfc/3339.md Show resolved Hide resolved

jaysachs and others added 3 commits January 23, 2019 20:25

Remove mention of analysisID from done() description.

c0bff62

Fully specify delimited proto as varint.

clarify

ce6665c

Merge branch 'master' into analyzer-rfc

88859ad

shahms reviewed Apr 24, 2019

View reviewed changes

justbuchanan mentioned this pull request May 16, 2019

feat(indexing): add binary for indexing multi-language kzips #3741

Closed

Merge branch 'master' into analyzer-rfc

38faed3


		- `vname(VNameRequest)`

		Requests the complete VName to emit for the specified path and signature.


		» {"jsonrpc":"2.0","method":"done","params":{"message":"success"}}

		### Errors

docs(rfc): An alternate RFC for common analyzer/driver interface. #3339

Are you sure you want to change the base?

docs(rfc): An alternate RFC for common analyzer/driver interface. #3339

Conversation

jaysachs commented Dec 21, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

creachadair Jan 3, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaysachs Jan 16, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shahms commented Oct 28, 2020

jaysachs commented Oct 28, 2020

creachadair commented Oct 28, 2020

jaysachs commented Dec 21, 2018 •

edited

creachadair Jan 3, 2019 •

edited

jaysachs Jan 16, 2019 •

edited