Stdlib: Add take, drop, split_at, take_while and drop_while to the List module #9968

kit-ty-kate · 2020-10-09T23:37:10Z

I recently had the need for a List.take function and no such function was available in the standard library.
This isn't the first time I needed such functions and I feel like many other people had the need for some of those at some point.
I hope this PR is helpful for those people.

Credit for the function names, and more importantly the test cases, goes to Haskell's Data.List documentation: https://hackage.haskell.org/package/base-4.14.0.0/docs/Data-List.html#g:11

…st module

craigfe · 2020-10-10T16:48:51Z

Spotted this PR just as I was about to create an identical one; thanks! 🙂

I think there's a strong case for adding these functions to Seq too (although you may wish to wait to see how the List ones are received first). I'm also happy to make that PR myself.

nojb

Personally I support this PR, but as usual with stdlib changes, let's wait for other developers to chime in.

nojb · 2020-10-11T06:13:13Z

stdlib/listLabels.mli

+     is not satisfied anymore.
+*)
+
+val drop_while : ('a -> bool) -> 'a list -> 'a list


Suggested change

val drop_while : ('a -> bool) -> 'a list -> 'a list

val drop_while : f:('a -> bool) -> 'a list -> 'a list

nojb · 2020-10-11T06:13:26Z

stdlib/listLabels.mli

+    This function is morally equivalent to [(take i l, drop i l)].
+*)
+
+val take_while : ('a -> bool) -> 'a list -> 'a list


Suggested change

val take_while : ('a -> bool) -> 'a list -> 'a list

val take_while : f:('a -> bool) -> 'a list -> 'a list

nojb · 2020-10-11T06:29:24Z

stdlib/list.mli

+    [l1] is the prefix of [l] before the index [i], and
+    [l2] is the suffix of [l] on and after the index [i].
+    This function is morally equivalent to [(take i l, drop i l)].
+*)


"since" annotation missing (here and below and in ListLabels as well).

nojb · 2020-10-11T06:30:50Z

stdlib/list.mli

+(** [split_at i l] returns a pair of lists [(l1, l2)], where
+    [l1] is the prefix of [l] before the index [i], and
+    [l2] is the suffix of [l] on and after the index [i].
+    This function is morally equivalent to [(take i l, drop i l)].


I would prefer a more succint wording: [split_at i l] is [(take i l, drop i l)], but more efficient.

nojb · 2020-10-11T06:34:52Z

stdlib/list.mli

+
+val take_while : ('a -> bool) -> 'a list -> 'a list
+(** [take_while p l] returns the prefix of [l] until the predicate [p]
+     is not satisfied anymore.


I find the english here a bit unclear. Perhaps something along the lines:

[take_while p l] is [(l1, l2)] where [l = l1 @ l2], [p] is [true] for every element of [l1] and is [false] for the first element of [l2] (if [l2] is non-empty).

I think this redrafted explanation does not typecheck: take_while p l is not a pair. For variety, here's the one that I was about to use:

[take_while p l] is the longest (possibly empty) prefix of [l] containing only elements that satisfy [p].

gasche · 2020-10-11T08:37:58Z

Personally I am a bit suspicious of the take and drop versions: they only marginally more efficient than split, and in my experience we need them rarely, except when we use the two of them at once and split would be better. For the _while versions, there is no split variant (so we only have functions that lose information), and I don't remember having used these functions in the past. In Haskell some of those are useful building blocks for parser-combinator libraries operating on [Char], or other such scripting-style transformations on the wrong datastructure to represent Text, but in OCaml how often are we using lists with those functions? (split_at is useful as one way to implement merge sort on lists).

For Seq, note that the split functions force their first component. Maybe we should use prefix_while : ('a -> bool) -> 'a Seq.t -> 'a list * 'a Seq.t? A more general function is prefix_map : ('a -> ('b, 'e) result) -> 'a Seq.t -> 'b list * ('a * 'e * 'a Seq.t) option.

craigfe · 2020-10-11T09:52:12Z

Going back over my code, here's a suggested use-case for each of take and drop (where split_at would be worse):

take n used for getting a "summary" of a long sequence (e.g. pretty-printing [1; 2; 3; ...], truncating log files, etc.).
drop n for discarding a header of size n from a file. (& drop_while used for trimming leading empty lines)

I encourage others to give their own examples.

Aside from the inefficiency, using split_at when only e.g. drop is needed results in code that I find much less readable:

drop n l

(* vs. *)

split_at (n + 1) l |> snd

Regarding Seq.split forcing the first component of its result, I think it's worth keeping this as a sequence so as to enable further transformations to be lazy. I think users can be expected to realise that this sequence is forced (particularly once it is documented), and we don't have any convenient functions for combining lists and sequences (so returning one of each seems likely to require a coercion).

~~If we wanted to efficiently support the case of take-ing a list from a sequence, I'd suggest a fold_until instead (although short-circuiting folds are their own barrel of worms).~~

edit (12-01-2021): After some thought & interactions with language newcomers / students, I've seen the error of my ways and would now advise returning lists wherever a sequence is necessarily forced. The better solution to my original concerns seems to be to offer more coercions between lists and other stdlib types (rather than just via. sequences).

gasche · 2020-10-11T10:06:12Z

take n used for getting a "summary" of a long sequence (e.g. pretty-printing [1; 2; 3; ...], truncating log files, etc.).

But then you may want to know whether you should print the ellipsis or not, by checking whether the "rest" of the list/sequence is empty.

drop n for discarding a header of size n from a file

Often it is useful to read the header first, to at least check that it corresponds to what you expect. Besides, when you do this kind of work, list are almost never the right data-structure, and even Seq.t rarely is, you want a representation of the input buffer with more arbitrary-access operations.

drop_while used for trimming leading empty lines

That sounds reasonable.

craigfe · 2020-10-11T10:21:50Z

Often it is useful to read the header first, to at least check that it corresponds to what you expect. Besides, when you do this kind of work, list are almost never the right data-structure, and even Seq.t rarely is, you want a representation of the input buffer with more arbitrary-access operations.

I was thinking of Seq.drop, and – with all due respect – I think string Seq.t is a perfectly valid and frequently-useful mechanism for reading files (especially config and data files), and having access to a lower-level representation like in_channel with arbitrary-access operations often ends up being unwieldy. Not all code needs to extract every iota of performance or consider every edge-case.

Regardless, I'm reminded why I don't like presenting use-cases of utility functions and will stop derailing the discussion 🙂

gasche · 2020-10-11T12:37:52Z

I hope you didn't find my pushback-comments taxing; presenting use-cases as you do is very useful, thanks!

undu · 2020-10-11T17:28:00Z

take n used for getting a "summary" of a long sequence (e.g. pretty-printing [1; 2; 3; ...], truncating log files, etc.).

In xapi it's used to limit the size of an error cache: https://github.com/xapi-project/xen-api/blob/b26de9798d2ac9c6e9fc4156906fa6e82ddb96dc/ocaml/xapi/storage_impl.ml#L322, to pad a list of up to a maximum number:
https://github.com/xapi-project/xenopsd/blob/98b7f818a655f655cddf5676ae271765e591fa71/xc/xenops_server_xen.ml#L1183, and to limit the amount of emulated NICs:
https://github.com/xapi-project/xenopsd/blob/98b7f818a655f655cddf5676ae271765e591fa71/xc/device.ml#L3888

For more usages, see https://github.com/search?p=1&q=org%3Axapi-project+%22List.take%22&type=Code

yallop · 2020-10-12T09:10:15Z

@gasche:

Personally I am a bit suspicious of the take and drop versions: they only marginally more efficient than split

drop seems quite a bit more efficient than split_at: drop n l doesn't allocate, while split_at n l allocates two lists of length n.

(Perhaps take will also be significantly more efficient than split_at once we have TRMC.)

gasche · 2020-10-12T09:12:57Z

(It's possible to write a trmc split_at with... a reference! In general I've discussed multi-output trmc with @chambart but we don't know how to do it nicely.)

gasche · 2021-01-28T05:20:44Z

Looking at this PR again (thanks @dra27 for the reminder):

I think it is fine if we have take and drop and split_at, even though I think the last one would probably be enough. I would support this part of the PR.
take_while and drop_while lose information and lack generality. I think we must have split_while in addition, and we should have the more general split_map : ('a -> ('b, 'e) result) -> 'a list -> 'b list * ('e * 'a list) option.

craigfe · 2021-01-28T22:49:33Z

Note that the word "split" is already used by List.split for what Haskell's Base calls "unzip". It's perhaps surprising that List.split_while and List.split_map are not generalisations of List.split, but personally I'd take the inconsistency if it comes with take(_while) and drop(_while) 🙂

I forgot this earlier, because I always forget that the Stdlib doesn't have List.unzip. Jane Street's Base includes both split_while and split_n, but they also use unzip rather than split.

I find the name and type of "split_map" a bit odd because

not all of the resulting elements are "mapped" from the input,
it "splits" into three segments, unlike its siblings, and
the case names of the result type don't match their interpretations by this function.

I worry that call-sites of this function would be difficult to understand for those reasons. In another library, this might be a good use of a "continue-or-stop" type:

type ('a, 'b) t =
| Continue of 'a
| Stop of 'b

although personally I'm not sure that split_while needs to be further generalised for the set of functions to make sense.

gasche · 2021-01-29T04:55:39Z

Thinking about this more, I agree that split_map is too complex. Let's settle for split_while.

(I don't think that the traversal behavior is a problem, it is clear from the type and find_map already behaves similarly, and I think that using result for continue-or-stop is rather natural, we accumulate while the function succeeds and stop when it fails. But continue-or-stop also makes sense indeed.)

0scarB · 2023-12-29T15:19:31Z

Similar to @kit-ty-kate I recently was in need of List.take for a summary use-case as @craigfe mentioned. Given the addition of Seq.take, Seq.drop, Seq.take_while and Seq.drop_while in #10583 it seems appropriate to also add them to the List module. As a newcomer to the language the lack of API parity between the two modules is unexpected. Additionally, this addition would make refactoring between these two data structures more intuitive. Finally, from my personal experience I mostly like to choose the conceptually simpler data structure first until I hit a performance wall -- Seq is slightly harder to work with because head is "consumed" leaving prior elements inaccessible without a small code rewrite.

The implementation chosen in Seq.take and Seq.drop now deviate from the original implementations proposed by @kit-ty-kate in that they raise Invalid_argument for negative values of n https://github.com/ocaml/ocaml/blame/b72d99b3dbf3a593f9ff38acc4a375a2e2fd8749/stdlib/seq.ml#L353-L366 and that they just use plain recursion instead of prepending values to an accumulator and then reversing it, which may be more performant. These inconsistencies need to be solved before the PR can be accepted.

This PR is a bit old. Would the maintainers (@gasche) like a new one to be made, closing this one? In which case @kit-ty-kate can work on it if they want or I'd be happy to submit one instead. I can commit to implement List.take, List.drop, List.take_while and List.drop_while and any similar functions from Seq that the maintainers would also like, do a bit of profiling and (very) basic checking of disassembled assembler if required.

I'm liking the language a lot so far. Thanks for all the hard work, it's greatly appreciated :)

tl;dr: List.take, List.drop, List.take_while and List.drop_while should probably be implemented for parity with Seq. Maybe in a new PR because this one is old.

gasche · 2023-12-29T15:42:14Z

I agree with the idea of implementing List functions that mirror Seq functions. I think that proposing {take,drop}{,_while} is a good idea. @0scarB I think that submitting a new PR for just that is a good move unless @kit-ty-kate volunteers to rework this one.

I expect the same behavior on invalid inputs between List and Seq. Regarding recursion vs. acc-and-reverse, note that Seq and List have different stack-consumption behaviors due to Seq being an on-demand data structure, so different approaches may make sense. We also have the option of using a [@tail_mod_cons] implementation.

@craigfe is making a good point that there is a name conflict between List.split and split_{at,while}. Should we consider the name cut instead of split? I think that this part of the API discussion could be continued here, independently of what happens to {take,drop}{,_while} for which there is naming consensus.

gasche · 2023-12-29T15:43:45Z

(Re. split, we could also add List.{un,}zip as aliases of List.{split,combine}, and de-emphasize List.split over time.)

0scarB · 2023-12-29T15:51:13Z

Hi @gasche, thanks for the quick reply. I agree that List.{take,drop}{,_while} should be a separate PR. split/cut/combine seem to require a bit more discussion. I'll give kit-ty-kate a couple of days to reply before starting work on the PR.

kit-ty-kate · 2023-12-29T16:30:21Z

@0scarB done in #12869

0scarB · 2023-12-30T23:46:19Z

split_at or cut make sense as a more optimised alternative to (take n l, drop n l). My suspicion is that this would be useful for cases where you have a list, ordered by priority, and want to handle elements with top priority differently than those with lower priority, e.g. search results.

My vote is for @kit-ty-kate original suggestion of split_at because String implements split_on_char.

@gasche's suggestion for zip/unzip as aliases for combine/split make sense because Seq already has both of those functions.

If List.split_at is implemented, Seq.split_at should probably also be implemented and String.split_at and Array.split_at should be considered.

@gasche would it be helpful if I documented the current inconsistencies in the functions implemented by Seq, List, Array and String here or in a new issue?

Stdlib: Add take, drop, split_at, take_while and drop_while to the Li…

3032b5e

…st module

nojb reviewed Oct 11, 2020

View reviewed changes

dra27 added the stdlib label Nov 25, 2020

dra27 mentioned this pull request Jan 28, 2021

Find n elements in list to get a sublist #10178

Closed

kit-ty-kate mentioned this pull request Dec 29, 2023

Stdlib: Add take, drop, take_while and drop_while to the List module #12869

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stdlib: Add take, drop, split_at, take_while and drop_while to the List module #9968

Stdlib: Add take, drop, split_at, take_while and drop_while to the List module #9968

kit-ty-kate commented Oct 9, 2020

craigfe commented Oct 10, 2020

nojb left a comment

nojb Oct 11, 2020

nojb Oct 11, 2020

nojb Oct 11, 2020

nojb Oct 11, 2020

nojb Oct 11, 2020

craigfe Oct 11, 2020

gasche commented Oct 11, 2020

craigfe commented Oct 11, 2020 •

edited

gasche commented Oct 11, 2020

craigfe commented Oct 11, 2020

gasche commented Oct 11, 2020

undu commented Oct 11, 2020 •

edited

yallop commented Oct 12, 2020

gasche commented Oct 12, 2020

gasche commented Jan 28, 2021 •

edited

craigfe commented Jan 28, 2021

gasche commented Jan 29, 2021

0scarB commented Dec 29, 2023

gasche commented Dec 29, 2023

gasche commented Dec 29, 2023

0scarB commented Dec 29, 2023

kit-ty-kate commented Dec 29, 2023

0scarB commented Dec 30, 2023

	val drop_while : ('a -> bool) -> 'a list -> 'a list
	val drop_while : f:('a -> bool) -> 'a list -> 'a list

	val take_while : ('a -> bool) -> 'a list -> 'a list
	val take_while : f:('a -> bool) -> 'a list -> 'a list

Stdlib: Add take, drop, split_at, take_while and drop_while to the List module #9968

Are you sure you want to change the base?

Stdlib: Add take, drop, split_at, take_while and drop_while to the List module #9968

Conversation

kit-ty-kate commented Oct 9, 2020

craigfe commented Oct 10, 2020

nojb left a comment

Choose a reason for hiding this comment

nojb Oct 11, 2020

Choose a reason for hiding this comment

nojb Oct 11, 2020

Choose a reason for hiding this comment

nojb Oct 11, 2020

Choose a reason for hiding this comment

nojb Oct 11, 2020

Choose a reason for hiding this comment

nojb Oct 11, 2020

Choose a reason for hiding this comment

craigfe Oct 11, 2020

Choose a reason for hiding this comment

gasche commented Oct 11, 2020

craigfe commented Oct 11, 2020 • edited

gasche commented Oct 11, 2020

craigfe commented Oct 11, 2020

gasche commented Oct 11, 2020

undu commented Oct 11, 2020 • edited

yallop commented Oct 12, 2020

gasche commented Oct 12, 2020

gasche commented Jan 28, 2021 • edited

craigfe commented Jan 28, 2021

gasche commented Jan 29, 2021

0scarB commented Dec 29, 2023

gasche commented Dec 29, 2023

gasche commented Dec 29, 2023

0scarB commented Dec 29, 2023

kit-ty-kate commented Dec 29, 2023

0scarB commented Dec 30, 2023

craigfe commented Oct 11, 2020 •

edited

undu commented Oct 11, 2020 •

edited

gasche commented Jan 28, 2021 •

edited