Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add primitive string type. #18973

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

rlepigre
Copy link
Contributor

@rlepigre rlepigre commented Apr 24, 2024

Since refactoring Constr.t to combine primitive values into an intermediate type is not a clear win (see #17951), here is an attempt to add primitive char and string types by simply extending Constr.t with corresponding primitive value constructors a primitive string type by simply extending Constr.t with a corresponding value constructor.

Check list:

  • Added / updated test-suite.
  • documentation
  • Added changelog.
  • Opened overlay pull requests.

@rlepigre rlepigre requested review from a team as code owners April 24, 2024 08:17
@coqbot-app coqbot-app bot added the needs: full CI The latest GitLab pipeline that ran was a light CI. Say "@coqbot run full ci" to get a full CI. label Apr 24, 2024
@silene
Copy link
Contributor

silene commented Apr 24, 2024

Two preliminary remarks:

  • Do not add new VM opcodes (especially some that will cause an instant segfault). Use existing opcodes instead. For example, PString.cat can directly be mapped to CHECKCAMLCALL2.
  • It seems we do not gain much from having a primitive type char. It seems like using int would work just as well, if not better.

@rlepigre
Copy link
Contributor Author

@silene I don't understand the code related to VM opcodes, so I'd really appreciate help here. By the way, why are many of these opcode named CHECK*? Are they not meant to correspond to specific instructions? (Your suggestion makes me think they are not.)

About your second point (using int as char), I'm not sure that this is the right thing, since char builds in the fact that the integer value is less than 256. Also, having a specific char type would probably be better when it comes to extraction? In any case, let's see what others think.

@silene
Copy link
Contributor

silene commented Apr 24, 2024

The CHECK part means that they support open terms as input. (It happens that this is the most common behavior, and it would have been better to use a NOCHECK prefix for their uncommon counterparts, but what is done is done.) Some of them correspond to specific instructions, e.g., CHECKADDINT63; some of them work with arbitrary code, e.g., CHECKCAMLCALL.

In your case, the starting point would be to modify Vmbytegen.get_caml_prim, and from there, the other needed changes should flow naturally.

@silene
Copy link
Contributor

silene commented Apr 24, 2024

since char builds in the fact that the integer value is less than 256

Note that this is not even an intrinsic property of your type. It is just a consequence of axiom chr_code_id. If you were to replace char by int, then this axiomatic property would just move from axiom chr_code_id to axiom make_get_spec.

@rlepigre rlepigre force-pushed the br/prim-string branch 2 times, most recently from eae1f69 to 044fcbf Compare April 24, 2024 16:09
@JasonGross
Copy link
Member

What is the benefit of having primitive strings over using array int? The downside I'm concerned about is ad-hoc complication of the kernel and tcb. (If the primary upside is extraction of literals, maybe something can be added to extraction in general?)
On the performance side:

  • Are any of the operations asymptotically more efficient?
  • What are the constant factor speedups that we see for each operation?

@SkySkimmer
Copy link
Contributor

Strings are more space efficient (1 byte / character instead of 1 word / character).
You could recover some efficiency by using all the bits in an int instead of just the lower 8 but since primitive ints are 63 bits it doesn't fit well.

@rlepigre
Copy link
Contributor Author

@JasonGross using arrays of integers, you are either space-inefficient (by storing one character per word) or you need to work with an encoding which requires computation. We have several use-cases where what we want is a compact representation of names (for variables, functions, ...) from a source program. These names would typically be string literals in a (generated) Coq source file, and they would only be carried around during program verification (and also used as keys for lookups in a maps).

@rlepigre rlepigre force-pushed the br/prim-string branch 3 times, most recently from b9e9fd0 to d268cf9 Compare April 25, 2024 06:56
@silene
Copy link
Contributor

silene commented Apr 25, 2024

What are the constant factor speedups that we see for each operation?

I performed a small test by repetitively concatenating strings of various size until the total size reaches 100,000 characters. For native_compute, the speedup of having native strings would be between 300x and 400x. For vm_compute, it would be between 1000x and 3000x.

To improve the situation when working with strings, we could add a primitive concatenation operation on arrays. This time, the speedup for native_compute would only be between 40x and 50x, while for vm_compute, it would be between 25x and 50x.

So, in addition to the fact that arrays occupy 8x more memory than strings (but this is hardly relevant performance-wise, unless you are actually filling your memory with strings), the things to remember are the following ones:

  1. Arrays in Coq are persistent, so mutating them leaves a trail, which has a non-negligible cost.
  2. Mutating/copying arrays in OCaml triggers the GC barrier, contrarily to strings (and floating-point arrays), which again has a non-negligible cost.
  3. Iterating in the VM is damn slow.

Obviously, these numbers are only meaningful under the assumption that no mutating operations are ever added to a native Coq string type. If strings start offer mutations like arrays do, then the performance gains would be a lot less dramatic.

kernel/vmemitcodes.ml Outdated Show resolved Hide resolved
kernel/vmemitcodes.ml Outdated Show resolved Hide resolved
kernel/vmvalues.ml Outdated Show resolved Hide resolved
theories/Strings/PString.v Outdated Show resolved Hide resolved
@rlepigre rlepigre force-pushed the br/prim-string branch 2 times, most recently from 713376b to b057cb6 Compare May 3, 2024 16:03
@rlepigre rlepigre force-pushed the br/prim-string branch 3 times, most recently from 12b405b to 6ba97e8 Compare May 4, 2024 09:01
theories/Strings/PString.v Outdated Show resolved Hide resolved
theories/Strings/PString.v Outdated Show resolved Hide resolved
theories/Strings/PString.v Outdated Show resolved Hide resolved
theories/Strings/PString.v Outdated Show resolved Hide resolved
\/
(i + 1 <? length s1 = true /\
i + 1 <? length s2 = true /\
get s1 (i+1) <=? get s2 (i+1) = true)%uint63).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure one can derive False from this axiom.

Needless to say, this axiom is way too complicated for its own good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I'm too happy about this axiom, but do you have an alternative in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the axiom should be on the primitive compare not the derived lt.
Then maybe define to_list : string -> list char and the axiom says forall s s', compare_string s s' = compare_list compare_int (to_list s) (to_list s') (not sure if we have compare_list already defined)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether we have compare_list, but we do have String.compare. More generally, I think we should strive to express all our axioms in terms of fully-defined inductive functions. (This is something we had in the old days of retroknowledge and that we unfortunately lost.) And then, properties like transitivity would be trivial corollaries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The axiom was indeed wrong, but it should be fixed now. I'll look into the other possibilities when I have a minute.

@github-actions github-actions bot added the needs: rebase Should be rebased on the latest master to solve conflicts or have a newer CI run. label May 7, 2024
@coqbot-app coqbot-app bot removed the needs: rebase Should be rebased on the latest master to solve conflicts or have a newer CI run. label May 8, 2024
Copy link
Contributor

@proux01 proux01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. About the specification, I concur with Guillaume that it would be better to have it in terms of equivalence with an inductive implementation, like it's done for int and float.

interp/notation.ml Outdated Show resolved Hide resolved
kernel/cClosure.ml Outdated Show resolved Hide resolved
kernel/environ.ml Outdated Show resolved Hide resolved
kernel/nativevalues.ml Show resolved Hide resolved
kernel/retroknowledge.ml Outdated Show resolved Hide resolved
Comment on lines +51 to +58
Goal make 5 "a" = cat (make 2 "a") (make 3 "a").
Proof. lazy. syntactic_refl. Qed.

Goal make 5 "a" = cat (make 2 "a") (make 3 "a").
Proof. cbn. syntactic_refl. Qed.

Goal make 5 "a" = cat (make 2 "a") (make 3 "a").
Proof. cbv. syntactic_refl. Qed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could also have tests for simpl and hnf.

Goal make 5 "a" = cat (make 2 "a") (make 3 "a").
Proof. cbv. syntactic_refl. Qed.

(* [vm_compute] *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C.f. above comment

Goal compare "a" "ab" = Lt.
Proof. vm_compute. syntactic_refl. Qed.

(* [native_compute] *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C.f. above comment

@@ -0,0 +1,236 @@
Require Import Uint63.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split this file into multiple ones (primitives, specs, axioms, lemmas) with minimal requirements otherwise it will be a nightmare the day we'll have to run primitive strings through the debuger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split the primitives to their own files, and same for the axioms. The rest is still in a single for for now, but still a work in progress.

theories/Strings/PString.v Outdated Show resolved Hide resolved
@rlepigre rlepigre force-pushed the br/prim-string branch 3 times, most recently from cc54007 to fcd48e8 Compare May 13, 2024 11:47
@github-actions github-actions bot added the needs: rebase Should be rebased on the latest master to solve conflicts or have a newer CI run. label May 22, 2024
Check (eq_refl : length (make (max_length+1) "a") = max_length).

Check (eq_refl : make 0 "a" = "").
Check (eq_refl : make 5 "a" = "aaaaa").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this is fine when it comes to testing the parsing but a printing test is probably also in order in test-suite/output (c.f. test-suite/output/FloatNumberSyntax.v for instance).

@proux01
Copy link
Contributor

proux01 commented May 23, 2024

So if I understand correctly, the only missing things here are the executable spec and the doc. FWIW if we want it into upcoming 8.20, this must be merged before June 17th.

@coqbot-app coqbot-app bot removed the needs: rebase Should be rebased on the latest master to solve conflicts or have a newer CI run. label May 25, 2024
Definition print (i : int_wrapper) : option string :=
if (i.(int_wrap) <? 256)%uint63 then Some (make 1 i.(int_wrap)) else None.
String Notation int_wrapper parse print : char63_scope.
Coercion int_wrap : int_wrapper >-> int.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the notation work, I had to use this coercion. Is that OK, or is there a better approach here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of something that doesn't work? I can't reproduce.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the test file I added has failures. Basically, using the notation gives a type error for me.

You can also try Check make 2 "c". at the very end of PrimString.v.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that's not a printing issue, rather a parsing one. The coercion is probably fine, otherwise you can copy what's done in number notation to string notation so that you can register the string notation for type int rather than int_wrapper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs: full CI The latest GitLab pipeline that ran was a light CI. Say "@coqbot run full ci" to get a full CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants