Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(rfc): An alternate RFC for common analyzer/driver interface. #3339

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

jaysachs
Copy link
Contributor

@jaysachs jaysachs commented Dec 21, 2018

This document serves to satisfy issue #3306, and is an evolution of RFC 2966. It defines the basic terminology, channel framing, message formats, and methods for a driver and analyzer to communicate uniformly via pipes or other reliable byte streams for the "control plane". File contents and analyzer output are satisfied via the analyzer's filesystem, presumably virtualized to some degree by the driver. It furthermore attempts to apply to less kythe-specific applications.

@jaysachs jaysachs changed the title An alternate RFC for common analyzer/driver interface. docs(rfc): An alternate RFC for common analyzer/driver interface. Dec 21, 2018
kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved
kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved
kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved
kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved
kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved
kythe/docs/rfc/3339.md Show resolved Hide resolved
kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved
kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved
@schroederc schroederc requested a review from zrlk December 21, 2018 22:48
kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved
kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved
kythe/docs/rfc/2966.md Outdated Show resolved Hide resolved
kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved
kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved
kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved
kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved
as well as the output file to which to emit analysis output. It
reflects some of the kythe.proto.CompilationUnit structure. The
driver (and driver environment) MUST make contents available for
each file referenced in the compilation record at the file system
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a little tricky. The compilation unit records physical paths (in the info field), as they were spelled by the build system at the moment when the data were captured. Those paths often assume the existence of a file tree with symlinks pointing outside the repository root.

To actually set that up you'd need to solve a set of constraints: Your virtual (or actual) filesystem root has to be high enough that all the up-pointers have something to traverse, and that down-steps are correctly populated within that hierarchy.

If the analyzer already virtualizes file access, then this is easy: You just map the path it requests directly to the data in the compilation record. But allowing the analyzer to access the filesystem through the system API means you have to actually create those structures (links, directories, etc.) and get their access permissions semi-right.

This isn't a dealbreaker, but it is an aspect of this design that requires much more detail that the old one didn't need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure I understand your point here. If you're "simply" saying that the driver may need to virtualize an arbitrary portion of the filesystem, I agree. That's effectively what the in-analyzer file system virtualization is doing.

What I (naively) imagined is that we'd set up the virtualization mapping precisely for every path mentioned in the compilation record. I believe(d) that this is independent of symlinks, or permissions -- that the driver virtualization gets first crack at any path.

If that's not the case, i.e. the driver is further downstream, then I agree there are some more subtleties here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[T]he driver may need to virtualize an arbitrary portion of the filesystem

It's a bit worse than that, though that is the basic problem.

Compilation records in general include paths that are aggregated from multiple locations in the workspace, and are accessed via symlinks and relative paths. To virtualize this correctly, the driver must compose an overlay view of multiple filesystems as they were organized at the time of extraction, faithfully implement symlinks and permissions to the point where the VFS will return correct results to the kernel, and ensure that the process working directory for the analyzer is set up correctly.

In other words, you basically have to recreate a container image of the original compiler execution. You can of course do so—but at that point it seems like you're better off just using an image (e.g., docker, say) with layered dependencies. Trying to shim the filesystem calls with (say) gVisor is a lot of work and will, I predict, be a maintenance nightmare.[1] Using docker images would obviously be a better solution for the way the open-source world builds, but it will make life harder for Google internal builds which don't expose any of that.

Having the analyzer mediate file lookups is actually much simpler and easier to get right, though as you pointed out it means each analyzer has to do it. Sometimes (usually?) that can be simplified with libraries, since often the analyzer doesn't need filesystem semantics at all—just a mapping from path to data. In Clang, for example, this is just an in-memory string-to-string map (albeit with a bad API), and once it's set up the analyzer never is self-contained.

Just in terms of maintenance, I'd argue it would be better give up on providing files "on demand" than to take on ownership of a filesystem shim sophisticated enough to recreate the execution environment at extraction time. Image construction and management is a big problem and isn't really the Kythe project's concern. At this level of detail we should use tools off-the-shelf rather than rolling our own.

What I don't fully understand yet is how much additional context that would require us to carry around between the build system and the analysis tools: The existing implementation was intended to minimize the amount of context that has to be propagated, but that may not be an important enough goal to keep.

[1] I have a lot of experience with this kind of code, and would be glad to discuss it in more detail offline of you like. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent a lot of time thinking about this, and I don't see the issue. But I'm new here, so clearly there's plenty of room for ignorance and misunderstanding ...

The compilation record stores, along with the byte content, the actual path (possibly relativized, and maybe bastardized for e.g. jar files) requested from the host OS, i.e. before any symlink/hardlink/overlay/blah is traversed/resolved. It's precisely that path that the virtualization layer will intercept requests for. Why is there anything to worry about here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there anything to worry about here?

When you virtualize the filesystem interface, you have to participate in path traversal and fulfill stat requests. Permissions, link counts, file types, and so on will need to be made "right" (or at least "right enough") to avoid spurious and hard-to-debug failures.

As far as the content there isn't any problem—but whereas explicitly virtualizing the lookup interface inside the analyzer allows you to treat this as a simple map lookup, emulating the real filesystem handlers requires much more mechanism. You're right, of course, that can be done, my point is that the spec will need to be much more explicit about the rules. When it goes wrong, the modes of failure arising from bad stat can be subtle.

As a case in point: GNU find trusts that the link counts it reads from the stat record are correct, and will only traverse a prefix of a directory whose link counts are wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much this actually occurs in practice, though.

I don't know. But I have seen some compilers do some odd things with the filesystem in order to obtain locking, memory-map files, and so on. Hopefully most of that stuff is ancillary to reading input files and can be ignored—but I do think this spec will need to spend some more time laying out the rules for what the virtualization promises (and thus what the analyzer can rely on). When we were going to make the analyzer figure out how to inject itself into the I/O mechanism of the underlying compiler explicitly, these things didn't need much discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing any and all of the weird POSIX or non-POSIX-but-damn-the-torpedoes corner cases expected by compilers-used-to-implement-analyzers is unlikely to be feasible. In the real world, I expect most compilers will fall into one of three categories:

  1. Compilers which do weird corner-casey-crap but have a Filesystem abstraction which can be implemented by the analyzer to not do so (or use a custom protocol directly as would be required by RFC2966).
  2. Compilers which do straightforward create/write open/read operations (but have no filesystem abstraction)
  3. Lost causes

Both this RFC and RFC2966 can support 1 with roughly the same amount of work, but this or something similar is required for 2. Given this I'd be very tempted to piggyback on the operations supported by something like 9P2000 if only because there are a broad set of implementations on a lot of platforms and it has better specified (and more limited) semantics. The "file operations" specified by RFC2966 are unlikely to be sufficient for pretending to be a filesystem. That said, I suspect including an RPC for emitting analysis output rather than necessitating the using the filesystem for output (rather than only input) will be simpler and sufficient.

The driver will also have some limited ability to rewrite the paths it sends to the analyzer. It should also have a pretty good idea of what files the analyzer may try to open as it shouldn't be opening anything not in the CU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit that the main reason I went with filesystem for output rather than RPC is pushback-on-multiple-encodings. I don't have a strong reason for pushing the filesystem as long as the output encoding can be negotiated (and efficiently supports proto delimited, in addition to JSON), except it keeps the rpc protocol simple and pure text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is certainly true that using a filesystem for output is a separate concern from using a virtual filesystem for input. I think the latter is the better-motivated case, and doesn't strictly need to be tied to the former.

(edit: fix typo)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way though, my main point here was that the spec should probably say more about what the VFS will provide.

kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved
kythe/docs/rfc/3339.md Show resolved Hide resolved
kythe/docs/rfc/3339.md Show resolved Hide resolved
as well as the output file to which to emit analysis output. It
reflects some of the kythe.proto.CompilationUnit structure. The
driver (and driver environment) MUST make contents available for
each file referenced in the compilation record at the file system
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing any and all of the weird POSIX or non-POSIX-but-damn-the-torpedoes corner cases expected by compilers-used-to-implement-analyzers is unlikely to be feasible. In the real world, I expect most compilers will fall into one of three categories:

  1. Compilers which do weird corner-casey-crap but have a Filesystem abstraction which can be implemented by the analyzer to not do so (or use a custom protocol directly as would be required by RFC2966).
  2. Compilers which do straightforward create/write open/read operations (but have no filesystem abstraction)
  3. Lost causes

Both this RFC and RFC2966 can support 1 with roughly the same amount of work, but this or something similar is required for 2. Given this I'd be very tempted to piggyback on the operations supported by something like 9P2000 if only because there are a broad set of implementations on a lot of platforms and it has better specified (and more limited) semantics. The "file operations" specified by RFC2966 are unlikely to be sufficient for pretending to be a filesystem. That said, I suspect including an RPC for emitting analysis output rather than necessitating the using the filesystem for output (rather than only input) will be simpler and sufficient.

The driver will also have some limited ability to rewrite the paths it sends to the analyzer. It should also have a pretty good idea of what files the analyzer may try to open as it shouldn't be opening anything not in the CU.

kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved

- `vname(VNameRequest)`

Requests the complete VName to emit for the specified path and signature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could detail a bit in which situation is this useful? How does the analyzer get hold of a path+signature, but can't construct the VName itself?

Later example indicates the analyzer doesn't need to know the corpus/root upfront anymore. Is that the motivation?

Copy link
Contributor Author

@jaysachs jaysachs Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it relieves the analyzer from having to thread through those two fields. More significantly, though, it opens up the possibility to handle the case when semantic entities "belong" to some other corpus and/or root: one could imagine wanting to associate files of dependencies with some "well known" corpus name, or that build-time generated files should have some specific root. In these cases, some component needs to "understand" the mapping of path to corpus/root, and it makes sense to insulate the myriad indexers from how to determine that mapping and encapsulate the implementation in the driver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The analyzer was already freed of the need to reason about vname mapping by the compilation record. As I understand it, the real change here is providing that information dynamically during the analysis instead of ahead-of-time. That's actually an interesting change, and I'm not entirely clear on the motivation for it.

We talked about various ways to solve this issue in the original implementation, and settled on making the build system capture that knowledge in the CU (that's what the vname fields are for, in the CU and its required inputs). I'm not sure what the argument for requesting it on the fly is, but it's a viable approach.

That said: We should probably discuss that question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that was too early in the AM for a reply. The per-file VName still would be present in the CU proto. The real reason for this API is that with the files being read from the file system, the only remaining motivation to expose the raw compilation unit mapping (FileInput) to the analyzer would be to supply the VName overrides. Thus the introduction of this API to encapsulate things. While it does permit dynamic non-CU-supplied VName construction, that wasn't its primary intention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Yes, that makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also turns out that it's incredibly common to get this mapping wrong, particularly for generated files and (presumably) files from a different corpus. Indexer authors routinely emit references using the raw "path" disregarding the CU/required_input VName, often up to hard-coding various values for "corpus" as well. Abstracting this behind an interface should substantially help with this problem.

That said, I think the generic vname() interface may still be a bit too low-level. There are a variety of other node kinds which have structural, factual or other VName requirements (tapp comes to mind), so it might be nice to express those requirements via an RPC/proto interface in some fashion.

But file/anchor VNames are the low-hanging fruit here.

kythe/docs/rfc/3339.md Outdated Show resolved Hide resolved
The output file should contain the content appropriate to the
analysis-type, encoded as specified in the `InitRequest`. For Kythe index
analyses, this would be `kythe.common.Entry` objects. Other analyzers might
emit a "findings" structure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This freedom is going to add complexity. Maybe the canonical output should be a gRPC Any proto? Or a finding proto with an Any field that can be used to encode Entries, etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you see the complexity? I'd expect the driver would know a priori what type of data to expect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This freedom does add complexity, but of a kind that is probably desirable: One of the goals of the common analyzer interface is to support analyzer tools beyond just Kythe indexers. (There is, or was, some language about that in the introduction, but it might want some more colour).

That said: The default handling of google.protobuf.Any is problematic in some important ways, and I would advise against extending our use of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if we can get more people piling on protocolbuffers/protobuf#5226 we can get a better resolution there (I'm not optimistic, but still...)


» {"jsonrpc":"2.0","method":"done","params":{"message":"success"}}

### Errors

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the analyzer communicate that it had a unrecoverable internal error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current proposal is that it terminates with a non-zero exit code. Do you see cases where an analyzer would want to abort the current analysis (e.g. due to unrecoverable malformed input) but continue to be able to request analyses, without restarting? Is that common enough that it warrants additional complexity?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I more had in mind that the analyzer could return a stack trace or other rich description of the error condition, which could be cleaner than scraping stderr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While some analyzers may be able to return structured data for internal errors quite a few of them will segfault and (if lucky) print a stack trace on stderr, so if we want to capture that we'll have to scrape stderr regardless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are different ways an analyzer can fail. If it fails under control, it can report an error in-channel, which along with the "done" message. If it fails out-of-control, however, the program will simply exit. The protocol needs to account for both.

Even when an analyzer tries to control failures, it can't always. Even in a managed runtime the OS may kill a process for exceeding resource limits or poor use of FFI. Moreover some compiler toolchains do not manage memory well (GHC, Python, Dart, and SBCL for some examples we've encountered in the past) and will pile up state and garbage if you re-use the analyzer without an exit in between.

So at some point you have to deal with the fact that the program either exits uncontrollably, or exits because it is simpler to exit and restart than to rewrite the compiler. I think the protocol should deal with that as simply as possible—at which point reading the exit code is probably the best we can hope for. Scraping stderr is probably not useful for an in-band message, though it makes sense to capture stderr for logging purposes.

driver MAY support a subset of read-only directory operations, e.g. in order
to provide directory listings via `readdir(3)` or equivalent.

- For all other paths, the driver MAY fall back to the underlying host

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be misunderstanding: does this mean to imply that paths in the compilation record MUST NOT fall back to the underlying host? Or is it just saying that access to the filesystem is not necessarily limited to compilation record paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter. Does the "MAY" not convey that sense sufficiently? I'm open to other wording.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am perhaps being overly literal and getting myself confused. Since falling back to the underlying host is not mentioned above, one might infer it is not allowed. It may help to replace "For all other paths..." with "For paths that are available via the host's local filesystem..."

analyzers for which the best, or only, mechanism for clearing state is to
restart.

### Messages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(aside: it would be nice to specify the protocol using proto, even if the implementation is JSON-RPC)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what kythe/proto/driver.proto was meant to represent, but I think it may be out-of-date with the current proposal.


- `vname(VNameRequest)`

Requests the complete VName to emit for the specified path and signature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also turns out that it's incredibly common to get this mapping wrong, particularly for generated files and (presumably) files from a different corpus. Indexer authors routinely emit references using the raw "path" disregarding the CU/required_input VName, often up to hard-coding various values for "corpus" as well. Abstracting this behind an interface should substantially help with this problem.

That said, I think the generic vname() interface may still be a bit too low-level. There are a variety of other node kinds which have structural, factual or other VName requirements (tapp comes to mind), so it might be nice to express those requirements via an RPC/proto interface in some fashion.

But file/anchor VNames are the low-hanging fruit here.

@shahms
Copy link
Contributor

shahms commented Oct 28, 2020

Jay, is this still relevant?

@jaysachs
Copy link
Contributor Author

Yes it is, though very low priority.

@creachadair
Copy link
Contributor

Yes it is, though very low priority.

Priority notwithstanding, I'd be interested in helping some version of this happen. I'd be happy to serve as a sounding-board for ideas if that would help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants