Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop relying on location to track usage #8934

Merged
merged 11 commits into from
Mar 6, 2020
Merged

Stop relying on location to track usage #8934

merged 11 commits into from
Mar 6, 2020

Conversation

trefis
Copy link
Contributor

@trefis trefis commented Sep 11, 2019

Back in march, I announced:

I wrote a prototype [...] based on 4.07 last week [...] which I plan to rebase
onto trunk (and submit a PR) next week.

It was a very long week, but the non-prototype version is finally appearing as PR!

Motivation

Whilst not relying on location to track usage would allow us to freely update locations to get better error messages (cf. #7859, #1737) without triggering spurious warnings (cf. #7852), it is not the main motivation for this work!

The real reason is that locations are a terrible way to identify declarations: you can declare many different things, with the same name, at the same location! Ppxes do it all the time, and back in the days camlp4 used to do it too (the compiler still bears the scars).

Examples of missing warnings in the ppx era

  • First an innocuous one: ppx_bin_prot generates things like:

    let read_tag buf ~pos_ref:_ vint = vint in
    let read_whatever buf ~pos_ref = Bytes.get buf ~pos_ref in
    ...

    where buf is unused in first function, but shares the location of the buf in the second one, which is used.
    That's not too bad, since the actual fix was to rename the first one to _buf.

  • Slightly worse, here is an example from ppx_variants_conv which is due to a using String.lowercase instead of String.uncapitalize:

    type t = C_a | C_A [@@deriving variants]

    generates:

    val c_a : t = C_A
    
    module Variants :
      sig
        val c_a : t Variant.t Variant.t
        (* ... *)
        val map : t -> c_a:'a -> c_a:(t Variant.t Variant.t -> 'b) -> 'b
        (* ... *)
      end

    One can guess which constructor the value c_a refers to, but one shouldn't have to guess. Also, notice that Variants.map has two parameters with the same name and ignores one of them.

  • Finally, and probably the worse one, it is also possible to make the compiler believe that a value (or constructor is used).
    For instance, at Jane Street we have a ppx which given:

    type t = Foo of int [@@deriving the_ppx]

    generates something that looks like this:

    module The_ppx_result = struct
      type t = Foo of The_ppx_runtime_lib.some_type
    
      let something_something =
        Foo (The_ppx_runtime_lib.some_value)
    end

    Here both Foos are declared at the same location, and since the second one (The_ppx_result.Foo) is used, then so is the first one!

In this PR

The basic idea is to assign a unique identifier to each declaration, and use this uid to track usage.

However, it is not completely straightforward: there are situations where the typechecker generates several declarations based on the same item in the source (for instance declaring a class will declare a class, a class type, a type). And situations where the typechecker generates declaration for internal use only, never exposing them to the user (when approximating a module type for example).

In the first case, it makes sense to give the same id to all the declarations: they are all generated by the same source declaration, using any of them means that the source declaration is used.
In the second case, we never want to report any warning on these, we don't intend for them to be used by the user, so we mark them as internal.

There are things one can turn to in the existing code when deciding whether to mint a new id or not: what location is attached to the declaration? Do we attach some checks to the declaration?

These aren't necessarily always right (in the sense that even though we might not want to emit a warning, we might still want to properly identify a declaration), but some preliminary PRs (#8885, #8891) should have made the system more consistent, and this PR easier to review.

Third party uses of uids

Using these new ids, it becomes fairly straightforward to build a tool that will do an usage analysis at the level of a library / binary, or even of a workspace.
(Note: I have heard rumors of teams at big tech/advertising companies doing such analyses by concatenating all their ml files into one, and relying on the usual compiler warnings.)

We built a prototype of such a tool at Jane Street (where the "concatenate all the files" approach clearly wouldn't have worked), using the initial version of the patch presented in this PR.

Since then, the Trustworthy Refactoring team published Characterising Renaming within Ocaml's Module System, which is a very nice presentation (and formalisation) of (essentially) the same problem. Their presentation identifies declarations with a location, but it is my impression that these are actually closer to the ids introduced by this PR, than to our Location.t (which, as discussed above, can't be trusted to identify a declaration).
It also seems fair to assume that their tool (rotor) could be simplified in places by using these ids; this essentially amounts to letting the compiler do the "binding resolution", which they currently have to recompute.

Keeping these use cases in mind can also help decide whether minting a fresh id would make sense even though the location is Location.none


Note: this is based on #8891 (itself based on #8908).

typing/types.ml Outdated
| Predef of string

let mk =
let id = ref (-1) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think the generator should be reset (or otherwise made local to the compilation_unit), so that compiling multiple units in a single invocation of the compiler produces the same .cmi files.

  • It would be better to provide custom equality and hash functions rather than relying on the generic operations(through Hashtbl in Env).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack for reset, I'll push something for that.

AS for custom operations: I'm fine with providing a custom equality function, but for hashing, I can't see myself doing anything better the generic hash function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, there should be local compare and equal functions and an appropriate application of Identifiable.Make, and use the result of that in Env.

I personally would prefer if ids were local to a compilation unit.

Do you have an opinion on whether to make a shortcut for Uid.mk ~current_unit:(Env.get_unit_name ()) ?

@Drup
Copy link
Contributor

Drup commented Oct 9, 2019

if we ignore wanting the freedom to give custom semantics and a different operation set, why are Ident.t insufficient for your purposes ? They are supposed to be unique per compilation unit (and they already have some handling of predef/persistent ids).

@trefis
Copy link
Contributor Author

trefis commented Oct 9, 2019

@Drup : there is more than one way in which to understand your question. I'll try to answer some of them.
One is:

Why aren't we just using Ident.t as keys for the "usage tables" instead of Location.t * string?

There are actually several reasons why that doesn't work. A very boring one is because when you use Foo.Bar.baz, which you want to count as an use of baz, you don't have baz' ident at hand, you have Foo's.
More fundamentally: there isn't such a thing as "baz' ident".

Another way to understand your question could be:

Why do you need a different uid, why can't you just take the ident that was first used to enter the declaration in the environment, and put it inside the declaration (like you're currently putting Uid.ts)?

Someone has actually asked me something along these lines offline. And the answer here is: I could do that (use the Ident.t that is), it would indeed work but... why would I do that?
I don't think it'd come out as cleanly. Also, one would have to be careful that Subst doesn't touch these particular idents.

@Drup
Copy link
Contributor

Drup commented Oct 9, 2019

I was thinking about the former, yes. The remark was that Uid.t and Ident.t are basically the same datastructure, and we already have some machinery for the idents. Your point about subst (and other related ident-modifying passes) is valid though. Having to deal with two "use" of idents, and how to make sure we only touch the rights one sound like a nightmare.

The problem then is that we basically have to carry two ids everywhere. I'm not sure how I feel about this.

@trefis
Copy link
Contributor Author

trefis commented Oct 9, 2019

I had a brief (offline) discussion with @alainfrisch about this PR, where he suggested using an "abstract location" (IIUC that would reflect a position in the AST rather than in the source) instead of an integer for these uids.
The idea being that uids might be more stable (for instance in the context of merlin) with such a representation.

What I told him then was roughly:

Perhaps. I would have to think about it. However, given that the type is abstract, it's a change that we can make after the fact, and I don't think we should block the PR for this question.

And I still stand by that last point.
(I have however given it a bit more thought, and I don't think that these "abstract locations" would be more stable than an integer. But they will be more costly to build)

@trefis trefis changed the base branch from pr8891 to trunk October 14, 2019 10:13
@trefis trefis marked this pull request as ready for review October 14, 2019 10:28
@trefis
Copy link
Contributor Author

trefis commented Feb 26, 2020

Quoting @damiendoligez from #8987 (comment):

I would say your choice was the right one and indeed using locations for computing unused variables was a mistake.

seems like the right way to revive this PR!

Would someone be willing to review this? @alainfrisch, @Drup: would one of you care to volunteer? If you still have doubts / questions, perhaps we could meet in person to discuss them.

Copy link
Contributor

@Drup Drup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, another PR that I though we had merged already. :)

I like the general approach and I think it works well. The code is clearly cleaner in many places, and when it isn't, it's orthogonal and will be addressed in later PRs. I'm pretty sure we could also reuse those Uids for other purposes as well. I think we should merge this.

@@ -648,7 +647,8 @@ let strengthen =
aliasable:bool -> t -> module_type -> Path.t -> module_type)

let md md_type =
{md_type; md_attributes=[]; md_loc=Location.none}
{md_type; md_attributes=[]; md_loc=Location.none
;md_uid = Uid.internal_not_actually_unique}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could remove this function. It doesn't make much sense anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was there before, and is still used in a few places.
I agree it'd be good to get rid of it, but let's perhaps do that in a separate PR?

typing/types.ml Outdated
| Predef of string

let mk =
let id = ref (-1) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, there should be local compare and equal functions and an appropriate application of Identifiable.Make, and use the result of that in Env.

I personally would prefer if ids were local to a compilation unit.

Do you have an opinion on whether to make a shortcut for Uid.mk ~current_unit:(Env.get_unit_name ()) ?

exp_extra = [];
exp_type = ty;
exp_env = env }
) body tunpacks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That function is 1) not pretty 2) a Frankenstein of the old wrap_unpacks and the relevant part of type_expect_. Fortunately, you already fixed that in #8935 ! At least the correctness is rather easy, since it's mostly copy-pasting, so we can push those consideration for later PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, it would be nice to at least avoid the code duplication if possible.

@trefis
Copy link
Contributor Author

trefis commented Feb 27, 2020

I just rebased this on top of trunk, there were some minor conflicts on typecore, and some bigger conflicts on predef.
The testsuite passes so I think I resolved everything correctly, but perhaps it would be sensible to double check the diff on predef.

I agree, there should be local compare and equal functions and an appropriate application of Identifiable.Make, and use the result of that in Env.

Done.

I personally would prefer if ids were local to a compilation unit.

Ident.ts are not, and there should be a lot more Ident.ts than Uid.ts, so is this really necessary?

Do you have an opinion on whether to make a shortcut for Uid.mk ~current_unit:(Env.get_unit_name ()) ?

Not really, the current code seems fine to me.

@trefis
Copy link
Contributor Author

trefis commented Mar 5, 2020

Seems I had

  • misunderstood the comment about making uids local to a compilation unit
  • misunderstood Ident.reinit when I read it last week.

Anyway, I've now added a reinit function in Types.Uid, and I think this is ready for merging.

Copy link
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a cursory look and have a small suggestion, but really I'm approving on behalf of @Drup.

typing/env.ml Outdated Show resolved Hide resolved
@damiendoligez damiendoligez self-assigned this Mar 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants