Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String suffix and prefix functions. #9533

Merged
merged 2 commits into from Jun 24, 2020
Merged

String suffix and prefix functions. #9533

merged 2 commits into from Jun 24, 2020

Conversation

bschommer
Copy link
Contributor

As I mentioned in PR #9446 I think it would be a good idea to have a function to check for prefix and suffix in the String module. Both ocaml-batteries as well as Base contain such a function and the module Filename implements a simplified version itself for the check_suffix method.

I implemented a first simple version of these functions but there are still some open points:

  • The name, ocaml-betteries uses starts_with and ends_with and Base uses is_prefix and is_suffix, which I personally find better.
  • If the functions should also be included in Bytes.

Any opinions on this?

@nojb
Copy link
Contributor

nojb commented May 4, 2020

I support it!

A couple of remarks:

  • It may be difficult to know which argument is which without looking at the docs. Am not sure what is the official policy, but a labelled argument for the suffix/prefix may help here.

  • Do we want to add a "ascii-case insensitive" optional argument to handle the common case when one wants to do the comparison without regards to (ASCII-)case ?

@bschommer
Copy link
Contributor Author

  • It may be difficult to know which argument is which without looking at the docs. Am not sure what is the official policy, but a labelled argument for the suffix/prefix may help here.

I think for the labeled version it is okay.

@dbuenzli
Copy link
Contributor

dbuenzli commented May 4, 2020

I think a label should be added in String aswell in this case.

I usually use affix which can then be used for String.is_{prefix,suffix,infix} and makes the signature of the three functions regular. See for example here.

  • Do we want to add a "ascii-case insensitive" optional argument to handle the common case when one wants to do the comparison without regards to (ASCII-)case ?

I would rather not complexify the interface of these simple functions. Lowercasing your datum before testing seems fine.

@alainfrisch
Copy link
Contributor

A related operation is to check for a given suffix/prefix and return the string with that suffix/prefix dropped (returning a string option). It is useful to have the versions that only checks for a suffix/prefix (to avoid the copy) as in this PR as well, but it would be good to have all the functions follow the same conventions, so it's worth discussing these as well. For them, being able to check modulo ASCII casing is even more important, since normalizing the input is not an option.

@dbuenzli
Copy link
Contributor

dbuenzli commented May 4, 2020

A related operation is to check for a given suffix/prefix and return the string with that suffix/prefix dropped (returning a string option).

Is it that important ?

Once you have checked the affix with the predicate function on the (possibly lowercased) datum it's just a matter of trimming the original datum by the length of your affix.

I rather have good cutting functions which are really lacking at the moment.

@alainfrisch
Copy link
Contributor

A related operation is to check for a given suffix/prefix and return the string with that suffix/prefix dropped (returning a string option).

Is it that important ?

Once you have checked the affix with the predicate function on the (possibly lowercased) datum it's just a matter of trimming the original datum by the length of your affix.

I rather have good cutting functions which are really lacking at the moment.

By the same argument, checking for prefix/suffix is "as simple" as cutting the string and comparing for equality. Avoiding length- and index-related calculation (and also copying) on the call side is the whole point, I think.

@dbuenzli
Copy link
Contributor

dbuenzli commented May 4, 2020

By the same argument, checking for prefix/suffix is "as simple" as cutting the string and comparing for equality.

Not really. It doesn't feel very natural to have to cut a string for checking for prefix/suffix and there are corner cases you'll have to deal with (like the datum is shorter than the prefix) if you tell people to go that way -- these corner cases do not exist once you have checked for affixness and say, simply drop_left the length of your prefix.

@bschommer
Copy link
Contributor Author

I added labels at least to the labeled version, I'm still unsure what the policy about labels for the non labeled modules so I left it out. Concerning the cutting functionality I have no strong opinion.

@@ -316,6 +316,14 @@ val equal: t -> t -> bool
(** The equal function for strings.
@since 4.03.0 *)

val is_prefix : t -> t -> bool
(** [String.is_prefix s pre] tests if [pre] is a prefix of [s]
Copy link
Contributor

@dbuenzli dbuenzli May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm concerned this is the wrong argument order. If I read String.is_prefix s0 s1 I expect the function to check that s0 is a prefix of s1. It's also an inconvenient order for partial application.

Containers' is_prefix has the order I suggest. Batteries and base do not have this order but I don't think it's a good idea to follow them on that one.

The same holds for is_suffix below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the current variant is consistent with the existing check_suffix function from Filename and consistent with the other search and check functions in the String module which all take the string on which to perform the actions are as first argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need at least one label here if we want the function, so that the order cannot be confused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the current variant is consistent with the existing check_suffix function from Filename

I don't think there's any need for consistency with that one. The name is sufficently different.

String module which all take the string on which to perform the actions are as first argument.

There's already split_on_char that does not follow this. Following what you mention results in something that reads wrong if applied unlabelled, please don't do this. Code reading is more important than the consistency (which never really exists in OCaml) you are looking for.

@alainfrisch
Copy link
Contributor

It doesn't feel very natural to have to cut a string for checking for prefix/suffix and there are corner cases you'll have to deal with (like the datum is shorter than the prefix) if you tell people to go that way

It doesn't feel very natural to me to lowercase a whole string to check for a given (small) prefix in a case-insensitive way...

As for the cutting "corner cases", one could provide cutting functions that don't fail (such as let prefix n s = String.sub s 0 (min n (String.length s))).

@alainfrisch
Copy link
Contributor

I'm in favor of using labels also for the "normal" versions (in String).

@dbuenzli
Copy link
Contributor

dbuenzli commented May 5, 2020

It doesn't feel very natural to me to lowercase a whole string to check for a given (small) prefix in a case-insensitive way...

At least it's a quick way to do it in a compositional way. Adding optional arguments for special casing the lowercase ASCII letters normalisation in a prefix checking function on bytes feels very ad-hoc to me.

It's again one of these very limited things you can't use half of the time because you have other normalisations to perform like transforming dashes to underscores or actually need to support the Unicode repertoire in a meaningful way. As a result, what you get is API cruft.

@gasche
Copy link
Member

gasche commented May 5, 2020

I believe we should drop the case-insensitivity idea, and agree to use a label in the standard version (let's be frank, no one uses the FooLabels modules), because otherwise the interface is too error-prone.

Personally I'm curious about the idea of having suffix-removing functions. It is natural to think of those functions as returning an augmented boolean, either a failure or the reduced string. So the general interface would be (string, unit) result or string option, or maybe string lazy option if we wanted to avoid computing the reduced string strictly. Having two functions, one returning bool for the no-need case and the other string option for the yes-need case is a natural reduction of this design, it gives a simpler API.

@bschommer
Copy link
Contributor Author

I added the label to string.mli and changed the order of arguments as suggested by @dbuenzli .

@bschommer
Copy link
Contributor Author

I rebased my branch, added @dbuenzli and @gasche as reviewer (since both are mentioned by Github in the reviewer field) and reverted the change to the Windows Filename suffix, since the Version using the new String.is_suffix function would require an explicit conversion of both parts of the filename which can be slower in the default use case of a suffix is short and long filename.

@bschommer
Copy link
Contributor Author

Is there anything missing or can this be merged?

@dbuenzli
Copy link
Contributor

dbuenzli commented Jun 4, 2020

I usually use affix which can then be used for String.is_{prefix,suffix,infix} and makes the signature of the three functions regular. See for example here.

I'm not sure I deeply stand behind this design but it still do like the regularity of the function signatures for certain higher order usages. This has not been discussed and since we will have to live with the functions forever it might be worth pondering.

(Don't take this as me being against the merge @bschommer, I'll be as glad as you to eventually have these functions in the stdlib).

@bschommer
Copy link
Contributor Author

I usually use affix which can then be used for String.is_{prefix,suffix,infix} and makes the signature of the three functions regular. See for example here.

I'm not sure I deeply stand behind this design but it still do like the regularity of the function signatures for certain higher order usages. This has not been discussed and since we will have to live with the functions forever it might be worth pondering.

(Don't take this as me being against the merge @bschommer, I'll be as glad as you to eventually have these functions in the stdlib).

I have no strong opinion on the signature of the two functions but the current version seems to me verbose enough to make clear how to use the functions.

@dbuenzli
Copy link
Contributor

dbuenzli commented Jun 4, 2020

I have no strong opinion on the signature of the two functions but the current version seems to me verbose enough to make clear how to use the functions.

Verbosity was not my point. The point is uniformity of signatures for these kind of predicates while retaining reading clarity.

@bschommer
Copy link
Contributor Author

I have no strong opinion on the signature of the two functions but the current version seems to me verbose enough to make clear how to use the functions.

Verbosity was not my point. The point is uniformity of signatures for these kind of predicates while retaining reading clarity.

Okay I understand the problem. A quick look at other languages shows that most prefer the name ends_with orstarts_with in some form as ocaml-batteries does. Also base as well as ocaml-batteries have the arguments switched. The label seems to be either suffix|prefix, suf|pref or affix.

I would argue that having the suffix or prefix first makes more sense since it seems to me more common to pass such a function with the same prefix or suffix as predicate to for example List.exits.

@dbuenzli
Copy link
Contributor

dbuenzli commented Jun 4, 2020

Sorry, I have myself argued above that the affix should be first so I thought that was evident.

I should have been more precise: I'm just questioning between having a uniform label (affix) or a specific label for each of the functions because you then get at least three functions (if you have is_infix) that share an interface for higher order usages (not sure I can convince myself that it's that useful though).

Regarding ends_with and starts_with I like the names but do they have one for is_infix ?

@bschommer
Copy link
Contributor Author

Batteries has count_string that counts the number of occurences, Base has is_substring and Containers has find as well as is_sub. For other programming languages, Haskell has isInfix and most other languages either have contains or find.

Concerning affix, I think that prefix and suffix are more descriptive and would tend to sacrifice that the two functions have the same signature for this.

@dbuenzli
Copy link
Contributor

dbuenzli commented Jun 4, 2020

Thanks @bschommer. So to sum up here are mostly equaly sound alternatives to me:

(* 1 *)
val contains : string -> string -> bool
val starts_with : string -> string -> bool
val ends_with : string -> string -> bool 

(* 2 *)
val is_infix : affix:string -> string -> bool
val is_prefix : affix:string -> string -> bool 
val is_suffix : affix:string -> string -> bool 

(* 3 *)
val is_infix : infix:string -> string -> bool 
val is_prefix : prefix:string -> string -> bool 
val is_suffix : suffix:string -> string -> bool 

I'm a bit unsure and don't have a strong opinion but I may be leaning a bit towards 1. which despite the seemingly less regular naming scheme reads quite well in my opinion.

@bschommer
Copy link
Contributor Author

Thanks @bschommer. So to sum up here are mostly equaly sound alternatives to me:

(* 1 *)
val contains : string -> string -> bool
val starts_with : string -> string -> bool
val ends_with : string -> string -> bool 

(* 2 *)
val is_infix : affix:string -> string -> bool
val is_prefix : affix:string -> string -> bool 
val is_suffix : affix:string -> string -> bool 

(* 3 *)
val is_infix : infix:string -> string -> bool 
val is_prefix : prefix:string -> string -> bool 
val is_suffix : suffix:string -> string -> bool 

I'm a bit unsure and don't have a strong opinion but I may be leaning a bit towards 1. which despite the seemingly less regular naming scheme reads quite well in my opinion.

I would prefer either 1 or 3, but have no strong opinion. Concerning the contains function, I would welcome the addition of such a function, but I afair there already was a PR for this and it was rejected since the function is already contained in some form in the Str module.

@alainfrisch
Copy link
Contributor

I like the function names in 1, but I dislike the lack of labels to distinguish the argument. As a predicate, it is more natural to have the "affix" string as the first argument (List.filter (contains "word") l), but as a "binary operator with a prefix notation", one would expect the opposite ordering (let (+?) = contains in if s +? word then ...). What about keeping the names as 1, but with a labeled argument?

E.g.:

val contains: pattern:string -> string -> bool
val starts_with: pattern:string -> string -> bool
val ends_with: pattern:string -> string -> bool 

@gasche
Copy link
Member

gasche commented Jun 5, 2020

contains foo bar reads as "foo contains bar" (same with all names in "option 1"), so it doesn't really make sense to have the pattern argument first. If you want the pattern first for partial application purposes, you should rather vote for 2 or 3, and/or consider a fully-labelled API -- but then the "naturalness" of the function names matters less.

@dbuenzli
Copy link
Contributor

dbuenzli commented Jun 5, 2020

Concerning the contains function, I would welcome the addition of such a function, but I afair there already was a PR for this and it was rejected

I think it should be added is @alainfrisch against ?

contains foo bar reads as "foo contains bar" (same with all names in "option 1"), so it doesn't really make sense to have the pattern argument first.

Funny. We don't read the same way. When I read String.starts_with s0 s1, I parse String.starts_with s0 and then apply it to s1.

I don't mind 1) with labels but the more I think about it the more I'm in favour of uniform labelling.

I was never really happy with my affix but pattern is slightly misleading in my eyes in that it suggests a regexp argument to me. It's not a pattern it's really a substring.

strstr(3)-like functions often refer to the thing to be searched as the "needle". I think it would be a good fit on contains but may feel a bit odd on starts_with and end_with:

(* 5 *)
val contains: needle:string -> string -> bool
val starts_with: needle:string -> string -> bool
val ends_with: needle:string -> string -> bool 

or if we drop the uniformity:

(* 6 *)
val contains: needle:string -> string -> bool
val starts_with: prefix:string -> string -> bool
val ends_with: suffix:string -> string -> bool 

@bschommer
Copy link
Contributor Author

bschommer commented Jun 5, 2020

I'm not sure if one does not want to have a contains of the form:

val contains: needle:string -> string -> int option

Returning Some index were index is the index of the first occurence of needle and None otherwise. Then we also would need to add rcontains for searching the last occurence.

@dbuenzli
Copy link
Contributor

dbuenzli commented Jun 5, 2020

Well in fact val contains : string -> char -> bool already exists...

@bluddy
Copy link
Contributor

bluddy commented Jun 5, 2020

I definitely agree with the naturalness of contains/starts_with/ends_with. For one thing, it removes the need for the user to know the definitions of prefix/suffix/infix. However, needle is a really unintuitive label IMO. pattern is decent, if longer than I'd like. sub or substr would be clear enough. substring is longer than I'd like, but the concept of a substring matches these functions better than a pattern (again, IMO).

@bschommer
Copy link
Contributor Author

I renamed the functions to starts_with and ends_with but kept suffix and prefix as the labels for the arguments. I think a unified name would be welcome however neither needle nor pattern does convince me.

@bschommer
Copy link
Contributor Author

I rebased the PR to avoid the conflict with the Changes file and I think there has been some kind of agreement on the API, so from my side this seems ready to be merged.

Copy link
Contributor

@nojb nojb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me, and indeed the discussion seems to have converged. Unless someone objects, I propose to merge by end of day today.

Changes Show resolved Hide resolved
The functions test if the second argument is a prefix or suffix of the
first argument.
@nojb nojb merged commit 34dbc54 into ocaml:trunk Jun 24, 2020
@nojb
Copy link
Contributor

nojb commented Jun 24, 2020

Merged. Thank you very much for your efforts @bschommer!

@bschommer bschommer deleted the string-suffix-prefix branch June 24, 2020 17:21
@dbuenzli
Copy link
Contributor

dbuenzli commented Aug 2, 2020

Well in fact val contains : string -> char -> bool already exists...

If someone once wants to add the infix version I'd suggest includes which is what JavaScript uses (other than that it also uses startsWith and endsWith).

@dbuenzli
Copy link
Contributor

dbuenzli commented Aug 2, 2020

And rereading the whole thing I think maybe we should have gone with the sub unifying label:

val includes: sub:string -> string -> bool
val starts_with: sub:string -> string -> bool
val ends_with: sub:string -> string -> bool 

Which I personally find better than:

val includes: sub:string -> string -> bool
val starts_with: prefix:string -> string -> bool
val ends_with: suffix:string -> string -> bool 

@bschommer
Copy link
Contributor Author

And rereading the whole thing I think maybe we should have gone with the sub unifying label:

val includes: sub:string -> string -> bool
val starts_with: sub:string -> string -> bool
val ends_with: sub:string -> string -> bool 

Which I personally find better than:

val includes: sub:string -> string -> bool
val starts_with: prefix:string -> string -> bool
val ends_with: suffix:string -> string -> bool 

I'm not sure what the policy is here, but since these function are not officially release one still could change the signature. I agree that sub might be a better label. Concerning includes, I think there was already a discussion in the past to add this, I think the conclusion was to just use the Str module.

@dbuenzli
Copy link
Contributor

dbuenzli commented Aug 3, 2020

I'm not sure what the policy is here, but since these function are not officially release one still could change the signature.

I don't think there's anything wrong with changing it. Precisely because it's not yet out in the wild. Once the latter occurs we'll be stuck with the names forever.

I agree that sub might be a better label.

It may be a little bit less precise but I think that from a code reading perspective it works equally well to disambiguate which is which and is a bit more ergonomic for the code writer (a single consistent name to remember, enables higher-order usages).

in the past to add this, I think the conclusion was to just use the Str module.

Str is not part of the stdlib and this was also argued at some point for functions which in the end were still integrated in the stdlib (e.g. String.split_on_char).

@dbuenzli
Copy link
Contributor

dbuenzli commented Aug 3, 2020

@bschommer If you don't feel like it I'm not suggesting you do the work for String.includes. However I do think we should at least change the labels before 4.12 though. If you can't do it, I'll manage to find sometime.

@bschommer
Copy link
Contributor Author

Concerning String.includes, I currently don't have enough time to implement it.

let starts_with ~prefix s =
let len_s = length s
and len_pre = length prefix in
let rec aux i =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will allocate a closure
extlib uses the following implementation : https://github.com/ygrek/ocaml-extlib/blob/master/src/extString.ml#L43

@ygrek
Copy link
Contributor

ygrek commented Dec 28, 2020

ftr extlib uses starts_with string prefix

@ygrek
Copy link
Contributor

ygrek commented Dec 28, 2020

While updating extlib for 4.12 I notice that for the people used to extlib order of arguments (and/or in the code that uses extlib) this change will definitely cause bugs (esp. given that "label omitted" warning is disabled by default).

@ygrek
Copy link
Contributor

ygrek commented Dec 28, 2020

Is it possible to change to String.starts_with s ~prefix?

@xavierleroy
Copy link
Contributor

See my comment #10105 (comment) . Too late to change in 4.12, but perhaps it is still time to not include these functions in 4.12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants