Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stdlib: Add take, drop, split_at, take_while and drop_while to the List module #9968

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

kit-ty-kate
Copy link
Member

I recently had the need for a List.take function and no such function was available in the standard library.
This isn't the first time I needed such functions and I feel like many other people had the need for some of those at some point.
I hope this PR is helpful for those people.

Credit for the function names, and more importantly the test cases, goes to Haskell's Data.List documentation: https://hackage.haskell.org/package/base-4.14.0.0/docs/Data-List.html#g:11

@craigfe
Copy link
Contributor

craigfe commented Oct 10, 2020

Spotted this PR just as I was about to create an identical one; thanks! 🙂

I think there's a strong case for adding these functions to Seq too (although you may wish to wait to see how the List ones are received first). I'm also happy to make that PR myself.

Copy link
Contributor

@nojb nojb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I support this PR, but as usual with stdlib changes, let's wait for other developers to chime in.

is not satisfied anymore.
*)

val drop_while : ('a -> bool) -> 'a list -> 'a list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val drop_while : ('a -> bool) -> 'a list -> 'a list
val drop_while : f:('a -> bool) -> 'a list -> 'a list

This function is morally equivalent to [(take i l, drop i l)].
*)

val take_while : ('a -> bool) -> 'a list -> 'a list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val take_while : ('a -> bool) -> 'a list -> 'a list
val take_while : f:('a -> bool) -> 'a list -> 'a list

[l1] is the prefix of [l] before the index [i], and
[l2] is the suffix of [l] on and after the index [i].
This function is morally equivalent to [(take i l, drop i l)].
*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"since" annotation missing (here and below and in ListLabels as well).

(** [split_at i l] returns a pair of lists [(l1, l2)], where
[l1] is the prefix of [l] before the index [i], and
[l2] is the suffix of [l] on and after the index [i].
This function is morally equivalent to [(take i l, drop i l)].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a more succint wording: [split_at i l] is [(take i l, drop i l)], but more efficient.


val take_while : ('a -> bool) -> 'a list -> 'a list
(** [take_while p l] returns the prefix of [l] until the predicate [p]
is not satisfied anymore.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the english here a bit unclear. Perhaps something along the lines:

[take_while p l] is [(l1, l2)] where [l = l1 @ l2], [p] is [true] for every element of [l1]
and is [false] for the first element of [l2] (if [l2] is non-empty).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this redrafted explanation does not typecheck: take_while p l is not a pair. For variety, here's the one that I was about to use:

[take_while p l] is the longest (possibly empty) prefix of [l] containing only elements that satisfy [p].

@gasche
Copy link
Member

gasche commented Oct 11, 2020

Personally I am a bit suspicious of the take and drop versions: they only marginally more efficient than split, and in my experience we need them rarely, except when we use the two of them at once and split would be better. For the _while versions, there is no split variant (so we only have functions that lose information), and I don't remember having used these functions in the past. In Haskell some of those are useful building blocks for parser-combinator libraries operating on [Char], or other such scripting-style transformations on the wrong datastructure to represent Text, but in OCaml how often are we using lists with those functions? (split_at is useful as one way to implement merge sort on lists).

For Seq, note that the split functions force their first component. Maybe we should use prefix_while : ('a -> bool) -> 'a Seq.t -> 'a list * 'a Seq.t? A more general function is prefix_map : ('a -> ('b, 'e) result) -> 'a Seq.t -> 'b list * ('a * 'e * 'a Seq.t) option.

@craigfe
Copy link
Contributor

craigfe commented Oct 11, 2020

Going back over my code, here's a suggested use-case for each of take and drop (where split_at would be worse):

  • take n used for getting a "summary" of a long sequence (e.g. pretty-printing [1; 2; 3; ...], truncating log files, etc.).
  • drop n for discarding a header of size n from a file. (& drop_while used for trimming leading empty lines)

I encourage others to give their own examples.

Aside from the inefficiency, using split_at when only e.g. drop is needed results in code that I find much less readable:

drop n l

(* vs. *)

split_at (n + 1) l |> snd

Regarding Seq.split forcing the first component of its result, I think it's worth keeping this as a sequence so as to enable further transformations to be lazy. I think users can be expected to realise that this sequence is forced (particularly once it is documented), and we don't have any convenient functions for combining lists and sequences (so returning one of each seems likely to require a coercion).

If we wanted to efficiently support the case of take-ing a list from a sequence, I'd suggest a fold_until instead (although short-circuiting folds are their own barrel of worms).

edit (12-01-2021): After some thought & interactions with language newcomers / students, I've seen the error of my ways and would now advise returning lists wherever a sequence is necessarily forced. The better solution to my original concerns seems to be to offer more coercions between lists and other stdlib types (rather than just via. sequences).

@gasche
Copy link
Member

gasche commented Oct 11, 2020

take n used for getting a "summary" of a long sequence (e.g. pretty-printing [1; 2; 3; ...], truncating log files, etc.).

But then you may want to know whether you should print the ellipsis or not, by checking whether the "rest" of the list/sequence is empty.

drop n for discarding a header of size n from a file

Often it is useful to read the header first, to at least check that it corresponds to what you expect. Besides, when you do this kind of work, list are almost never the right data-structure, and even Seq.t rarely is, you want a representation of the input buffer with more arbitrary-access operations.

drop_while used for trimming leading empty lines

That sounds reasonable.

@craigfe
Copy link
Contributor

craigfe commented Oct 11, 2020

Often it is useful to read the header first, to at least check that it corresponds to what you expect. Besides, when you do this kind of work, list are almost never the right data-structure, and even Seq.t rarely is, you want a representation of the input buffer with more arbitrary-access operations.

I was thinking of Seq.drop, and – with all due respect – I think string Seq.t is a perfectly valid and frequently-useful mechanism for reading files (especially config and data files), and having access to a lower-level representation like in_channel with arbitrary-access operations often ends up being unwieldy. Not all code needs to extract every iota of performance or consider every edge-case.

Regardless, I'm reminded why I don't like presenting use-cases of utility functions and will stop derailing the discussion 🙂

@gasche
Copy link
Member

gasche commented Oct 11, 2020

I hope you didn't find my pushback-comments taxing; presenting use-cases as you do is very useful, thanks!

@undu
Copy link

undu commented Oct 11, 2020

  • take n used for getting a "summary" of a long sequence (e.g. pretty-printing [1; 2; 3; ...], truncating log files, etc.).

In xapi it's used to limit the size of an error cache: https://github.com/xapi-project/xen-api/blob/b26de9798d2ac9c6e9fc4156906fa6e82ddb96dc/ocaml/xapi/storage_impl.ml#L322, to pad a list of up to a maximum number:
https://github.com/xapi-project/xenopsd/blob/98b7f818a655f655cddf5676ae271765e591fa71/xc/xenops_server_xen.ml#L1183, and to limit the amount of emulated NICs:
https://github.com/xapi-project/xenopsd/blob/98b7f818a655f655cddf5676ae271765e591fa71/xc/device.ml#L3888

For more usages, see https://github.com/search?p=1&q=org%3Axapi-project+%22List.take%22&type=Code

@yallop
Copy link
Member

yallop commented Oct 12, 2020

@gasche:

Personally I am a bit suspicious of the take and drop versions: they only marginally more efficient than split

drop seems quite a bit more efficient than split_at: drop n l doesn't allocate, while split_at n l allocates two lists of length n.

(Perhaps take will also be significantly more efficient than split_at once we have TRMC.)

@gasche
Copy link
Member

gasche commented Oct 12, 2020

(It's possible to write a trmc split_at with... a reference! In general I've discussed multi-output trmc with @chambart but we don't know how to do it nicely.)

@gasche
Copy link
Member

gasche commented Jan 28, 2021

Looking at this PR again (thanks @dra27 for the reminder):

  • I think it is fine if we have take and drop and split_at, even though I think the last one would probably be enough. I would support this part of the PR.
  • take_while and drop_while lose information and lack generality. I think we must have split_while in addition, and we should have the more general split_map : ('a -> ('b, 'e) result) -> 'a list -> 'b list * ('e * 'a list) option.

@craigfe
Copy link
Contributor

craigfe commented Jan 28, 2021

Note that the word "split" is already used by List.split for what Haskell's Base calls "unzip". It's perhaps surprising that List.split_while and List.split_map are not generalisations of List.split, but personally I'd take the inconsistency if it comes with take(_while) and drop(_while) 🙂

I forgot this earlier, because I always forget that the Stdlib doesn't have List.unzip. Jane Street's Base includes both split_while and split_n, but they also use unzip rather than split.


I find the name and type of "split_map" a bit odd because

  • not all of the resulting elements are "mapped" from the input,
  • it "splits" into three segments, unlike its siblings, and
  • the case names of the result type don't match their interpretations by this function.

I worry that call-sites of this function would be difficult to understand for those reasons. In another library, this might be a good use of a "continue-or-stop" type:

type ('a, 'b) t =
| Continue of 'a
| Stop of 'b

although personally I'm not sure that split_while needs to be further generalised for the set of functions to make sense.

@gasche
Copy link
Member

gasche commented Jan 29, 2021

Thinking about this more, I agree that split_map is too complex. Let's settle for split_while.

(I don't think that the traversal behavior is a problem, it is clear from the type and find_map already behaves similarly, and I think that using result for continue-or-stop is rather natural, we accumulate while the function succeeds and stop when it fails. But continue-or-stop also makes sense indeed.)

@0scarB
Copy link

0scarB commented Dec 29, 2023

Similar to @kit-ty-kate I recently was in need of List.take for a summary use-case as @craigfe mentioned. Given the addition of Seq.take, Seq.drop, Seq.take_while and Seq.drop_while in #10583 it seems appropriate to also add them to the List module. As a newcomer to the language the lack of API parity between the two modules is unexpected. Additionally, this addition would make refactoring between these two data structures more intuitive. Finally, from my personal experience I mostly like to choose the conceptually simpler data structure first until I hit a performance wall -- Seq is slightly harder to work with because head is "consumed" leaving prior elements inaccessible without a small code rewrite.

The implementation chosen in Seq.take and Seq.drop now deviate from the original implementations proposed by @kit-ty-kate in that they raise Invalid_argument for negative values of n https://github.com/ocaml/ocaml/blame/b72d99b3dbf3a593f9ff38acc4a375a2e2fd8749/stdlib/seq.ml#L353-L366 and that they just use plain recursion instead of prepending values to an accumulator and then reversing it, which may be more performant. These inconsistencies need to be solved before the PR can be accepted.

This PR is a bit old. Would the maintainers (@gasche) like a new one to be made, closing this one? In which case @kit-ty-kate can work on it if they want or I'd be happy to submit one instead. I can commit to implement List.take, List.drop, List.take_while and List.drop_while and any similar functions from Seq that the maintainers would also like, do a bit of profiling and (very) basic checking of disassembled assembler if required.

I'm liking the language a lot so far. Thanks for all the hard work, it's greatly appreciated :)


tl;dr: List.take, List.drop, List.take_while and List.drop_while should probably be implemented for parity with Seq. Maybe in a new PR because this one is old.

@gasche
Copy link
Member

gasche commented Dec 29, 2023

I agree with the idea of implementing List functions that mirror Seq functions. I think that proposing {take,drop}{,_while} is a good idea. @0scarB I think that submitting a new PR for just that is a good move unless @kit-ty-kate volunteers to rework this one.

I expect the same behavior on invalid inputs between List and Seq. Regarding recursion vs. acc-and-reverse, note that Seq and List have different stack-consumption behaviors due to Seq being an on-demand data structure, so different approaches may make sense. We also have the option of using a [@tail_mod_cons] implementation.

@craigfe is making a good point that there is a name conflict between List.split and split_{at,while}. Should we consider the name cut instead of split? I think that this part of the API discussion could be continued here, independently of what happens to {take,drop}{,_while} for which there is naming consensus.

@gasche
Copy link
Member

gasche commented Dec 29, 2023

(Re. split, we could also add List.{un,}zip as aliases of List.{split,combine}, and de-emphasize List.split over time.)

@0scarB
Copy link

0scarB commented Dec 29, 2023

Hi @gasche, thanks for the quick reply. I agree that List.{take,drop}{,_while} should be a separate PR. split/cut/combine seem to require a bit more discussion. I'll give kit-ty-kate a couple of days to reply before starting work on the PR.

@kit-ty-kate
Copy link
Member Author

@0scarB done in #12869

@0scarB
Copy link

0scarB commented Dec 30, 2023

split_at or cut make sense as a more optimised alternative to (take n l, drop n l). My suspicion is that this would be useful for cases where you have a list, ordered by priority, and want to handle elements with top priority differently than those with lower priority, e.g. search results.

My vote is for @kit-ty-kate original suggestion of split_at because String implements split_on_char.

@gasche's suggestion for zip/unzip as aliases for combine/split make sense because Seq already has both of those functions.

If List.split_at is implemented, Seq.split_at should probably also be implemented and String.split_at and Array.split_at should be considered.

@gasche would it be helpful if I documented the current inconsistencies in the functions implemented by Seq, List, Array and String here or in a new issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants