Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seq.range #10480

Open
wants to merge 10 commits into
base: trunk
Choose a base branch
from
Open

Seq.range #10480

wants to merge 10 commits into from

Conversation

tbrugere
Copy link

Hi! This is my first PR (ever in an open-source project), so please be patient with me.

Adds

Seq.range
?start: int -> ?stop: int -> ?step: int -> int Seq.t

(I am not sure about the interface, see interface)

What does it do?

This is the equivalent of range from python, or iota from lisp / C...

It builds a sequence containing a range of integers, starting at start (default 0), ending at end not included (default: infinite), and doing steps of step.

More formally, it creates a Seq.t representing the sequence $s_i$ so that
$$s_i = \text{start} + \text{step} \times i$$

and so that

  • If stop is given, that sequence is finite, and contains only the elements that are in the segment
    • $[\text{start}, \text{stop} [$ (if step >= 0)
    • $]\text{stop}, \text{start}]$ (if step < 0)
  • If stop is not given, the sequence is finite

Why?

I love the concept of the Seq.t type, I use python a lot, and I would love to be able to do all that python iterators allow me to do in ocaml, but there is a slight problem with that:

The Seq module doesn’t have a lot of primitives (it’s got 14 functions. Compare with List that has 62 (thanks %s/^val//n)

In particular there isn’t any function in that module to create seq. Though there are conversions from other types, that means that there is no function for creating infinite sequences, or to build sequences as we traverse them

I learnt (from the stdlib/CONTRIBUTING file) that anybody could contribute to the stdlib, so here I am.

Interface

Seq.range
?start: int -> ?stop: int -> ?step: int -> int Seq.t

I’m not very sure about this interface, mainly because all arguments are optional, and optional arguments are rare in the stdlib. (though there are functions that have a lot of optional arguments, for example Filename.Unix.quote_command)

The name and arguments are mainly inspired from the python range function. The fact that they’re named makes them self-explanatory.

The Batteries implementation (in BatEnum) is the following

val range : ?until:int -> int -> int t

The differences are that the start is positional, and there is no step. I think removing step has no advantage since it is an optional argument anyway.

The Containers one is

val range : int -> int -> int t

there is no step and all arguments are positional. And stop is included by default (but there is also a --^ operator that doesn’t include it)

Janestreet Base has:

val range : ?⁠stride:int -> ?⁠start:[ `inclusive | `exclusive ] -> ?⁠stop:[ `inclusive | `exclusive ] -> int -> int -> int t

It is very clear and self-explanatory (especially the \inclusive/ `exclusive` make the function very clear), but because of that it has a lot of arguments, so the interface may be a bit harder to remember

In conclusion, other sensible options include

  • making stop a positional argument (like in python, or in BatEnum)
  • not allowing infinite ranges (like in python)
  • adding / replacing it by a -- operator like in Containers
  • using the same interface as Base (or similar)

Remarks

  • Why is this valid even though it only has optional arguments? Because 'a Seq.t is actually a shorthand for unit ->'a Seq.node, this means this function is actually of type
?start: int -> ?stop: int -> ?step: int ->  unit -> int Seq.node

That explains why there is no (technical) need to have a non-optional arguments here.

Other python iterator functions that would be interesting to have in Seq

and that I would like to implement

  • Seq.init (like List.init) though I’m not sure this is necessary with Seq.range
  • an equivalent of the python itertools.product, that calculates the cartesian product betweens two sequences (and returns a sequence of that product in alphabetic order)
  • Either a Seq.map2 (like List.map2), or a Seq.zip : 'a Seq.t -> 'b Seq.t -> 'a*'b Seq.t like python.zip (one would be easily derivable from the other)
  • Other utility functions that are implemented for list but not for seq such as for_all, reduce, mem, find

@nojb
Copy link
Contributor

nojb commented Jun 27, 2021

xref #9386

@nojb nojb added the stdlib label Jun 27, 2021
@nchataing
Copy link
Contributor

nchataing commented Jun 28, 2021

Hi!

While I am in favor of this function addition, I'm not really convinced by the current function signature : I think it is not very much in the OCaml spirit to have optional arguments, and as you brought out, they are rare in the stdlib. What's more the other implementation suggest to have at least start and step as positional arguments.

For the case of stop, my original idea when reading the PR is that we could have two functions instead of one :

  • Seq.range for finite range;
  • Seq.inf_range (or something like this) for infinite range.
    Another idea could be to have a type Finite of int | Infinite and then have the Infinite value by default (or directly an int option which is by default None)

@tbrugere
Copy link
Author

tbrugere commented Jun 28, 2021

Hi !
For the case of stop, I agree that the options you are proposing would be clearer. I think Seq.range / Seq.inf_range would be better, mainly because

  1. It would not involve creating a type just for 1 function (though it’s true we could use [`Finite of int | `Infinite] instead to mitigate that)
  2. In my experience, finite iterators are more frequent than infinite ones. Hence it would be better if the command for creating a finite range was shorter (ie, not having to write Finite / Some )

Regarding start making it a positional argument would be more idiomatic and consistent with the rest of the stdlib, you’re right.

Regarding step though, none of the other implementations put it as a positional argument:

  • In Batteries and Containers, there is no step argument
  • In Base it is an optional argument called ?stride

I think having that argument is important because it allows for backward ranges, but I don’t know if it is frequent enough to justify having it as a positional argument

What do you think about the following signatures (similar to the Base signature, except for the ommited [`included| `excluded] magic)

Seq.range: ?step:int -> int -> int -> int Seq.t
Seq.inf_range: ?step:int -> int -> int Seq.t

@nchataing
Copy link
Contributor

What do you think about the following signatures (similar to the Base signature, except for the ommited [`included| `excluded] magic)

Seq.range: ?step:int -> int -> int -> int Seq.t
Seq.inf_range: ?step:int -> int -> int Seq.t

Looks good :-)

@xavierleroy
Copy link
Contributor

For the record: I'm not a fan of Seq.inf_range. "An infinite range of integers" is not commonly used (less than 4000 Google hits), and "inf" can mean "infimum" as much as "infinite".

Seq.from 0 ?step:2 looks better, but it could be argued that "from" is very generic and could also mean "convert a collection to a sequence".

@tbrugere
Copy link
Author

Seq.from 0 ?step:2 looks better, but it could be argued that "from" is very generic and could also mean "convert a collection to a sequence".

What do you think about Seq.count_from then? It doesn’t look like a converting operator. Though it loses the advantage of having almost the same name as range. It also sounds the same as the function in some other languages

  • python uses count (but that’s not very readable since it could be counting the number of occurences of an element)
  • boost (c++) has counting_iterator

@xavierleroy
Copy link
Contributor

Let me come back to your remark

In particular there isn’t any function in that module [Seq] to create seq

First, this isn't quite right: see Seq.cons for example.

Second, that's because many functions that create sequences are in other modules, e.g. List.to_seq instead of Seq.of_list.

To be consistent, maybe your range function belongs to module Int and not to module Seq. So that you can have Int.range but also Int32.range or maybe even Float.range.

@tbrugere
Copy link
Author

tbrugere commented Jun 30, 2021

In particular there isn’t any function in that module to create seq.
First, this isn't quite right: see Seq.cons for example.

EDIT :This part of my comment is essentially wrong, see yallop's comment below

That’s true. I think this part of my message wasn’t very precise, so let me attempt to clarify it
Right now the only ways to create a Seq.t are:

  • (Using the constructors directly.)
  • Converting one from another type (with, for example List.to_seq)
  • Using Seq.empty or Seq.return (and, afterwards, filling it with e.g. Seq.cons).
  • Creating one from another Seq.t with Seq.cons, Seq.append, Seq.map, Seq.filter...

These are already very useful because they allow to :

  • easily convert between types, by using seq as an intermediate
  • make operations on data without explicitly constructing the intermediate representations.

But, what I wanted to say by pointing that out, is that in my opinion, none of those ways (except of course for the constructor) make use of the full potential of seq:

  • They do not allow one to create an infinite sequence
  • They do not allow one to actually work on data that doesn’t exist in memory: either the data comes from a preexistent data structure, or you are constructing it with cons / empty, in which case you basically have a list in memory.

This is what I meant by

In particular there isn’t any function in that module to create seq. Though there are conversions from other types, that means that there is no function for creating infinite sequences, or to build sequences as we traverse them

I hope it is clearer now ^^


To be consistent, maybe your range function belongs to module Int and not to module Seq. So that you can have Int.range but also Int32.range or maybe even Float.range.

Good point, it is true that this implementation is specific to the int datatype.
Having Int.range, Int32.range, Int64.range and maybe Float.range would solve that problem. For the record, Containers has CCInt.range, CCInt64.range… (as well as the -- operator in the CCSeq module).

CCInt.range: t -> t -> t iter

Personally, I think it would be a bit confusing. The Int module is not the place in which I’d look for such a function… but that might just be me.

Moreover, I am not sure this would make the naming more consistent : the List.to_seq, Array.to_seq are in other modules because they’re conversion functions. Meanwhile range doesn’t really convert an int to a seq

Additionally, in other parts of the stdlib, there are functions for creating datatypes using int, that do not have an equivalent Int32 or Int64 versions, such as

List.init : int -> (int -> 'a) -> 'a list

Another solution would be perhaps to

  • provide range only for int
  • make it easy to create range for other datatypes.

For example we could provide the function (I’m not sure how to name it, repeat is a very bad name)

Seq.repeat: ('a -> 'a) -> 'a -> 'a t

such that
Seq.repeat f x0 yields x0, f x0, f (f x0) f ( f ( f x0)) etc.

This would make it easy to recreate the infinite range for e.g. Int32

let int32_inf_range ?(step=Int32.one) start = Seq.repeat (Int32.add step)  start

But also for, for example, for Zarith integers

let z_inf_range ?(step=Z.one) start = Seq.repeat (Z.add step)  start

For the finite range, we could similarly make a finite version of repeat.

What do you think?

@yallop
Copy link
Member

yallop commented Jun 30, 2021

You should take a look at Seq.unfold, which allows you to write things like this:

# let range ~start ~stop ~step =
     Seq.unfold (fun i -> if i >= stop then None else Some (i, i+step)) start;;
val range : start:int -> stop:int -> step:int -> int Seq.t = <fun>

# List.of_seq (range ~start:10 ~stop:20 ~step:2);;
- : int list = [10; 12; 14; 16; 18]

or like this:

# let repeat f x = Seq.unfold (fun x -> Some (x, f x)) x;;
val repeat : ('a -> 'a) -> 'a -> 'a Seq.t = <fun>

@tbrugere
Copy link
Author

You should take a look at Seq.unfold, which allows you to write things like this:

Seems like I have read the docs too fast, I wasn’t aware of unfold. Thank you, this simplifies it a lot!

Also it makes my earlier statement about the methods for creating seqs pretty wrong, my bad.

@tbrugere
Copy link
Author

tbrugere commented Jul 5, 2021

Well as a recap, before I change the interface, we have two open questions:

  • should I put range in Seq or in Int (and Int32 etc.) ?
  • inf_range vs. count_from (as a name)

(caveats for each are discussed above)

I'm personnally more in favor of inf_range and putting the functions in module Seq, but how do we decide which interface to keep? Some kind of majority vote?

@ulugbekna
Copy link
Contributor

Doesn't it make more sense to

  • make Seq.range bounded (with signature val range : ?step: int -> int -> int -> t) and
  • introduce another function Seq.from for an unbounded range (val from: ?step: int -> int -> t).

I think this way there is very little surface for argumentation.

The end of the range should either be exclusive as often the case with such iterators (https://doc.rust-lang.org/std/ops/struct.Range.html) and for situations like range 0 (List.length l), or be inclusive as the OCaml for-loop (as a mnemonic). Either way is fine, I guess.

I would say this function belongs to either Seq module or Seq.Int. I tend to think people are more likely to look for an iterator initializer within that iterator's module.

@tbrugere
Copy link
Author

tbrugere commented Jul 19, 2021

OK I split the function into

Seq.range : ?step:int -> int -> int -> int Seq.t 
Seq.count_from : ?step:int -> int -> int Seq.t

As per your recommendations.
I’m marking it as ready for review then

@tbrugere tbrugere marked this pull request as ready for review July 19, 2021 14:52
@nchataing
Copy link
Contributor

Your last commit makes change in part of the code that is unrelated to this PR, I think you should revert it.

This reverts commit 9cf0252.
@tbrugere
Copy link
Author

Yes, I ran ocp-indent on the code since it didn’t pass the checks, but I didn’t think about the side effects… I’m reverting it

@hcarty
Copy link
Member

hcarty commented Jul 23, 2021

If #9312 is happening soon then it's worth making sure that the implementation here would work well with what @c-cube and @fpottier have planned for other additions to Seq.

@tbrugere
Copy link
Author

I looked at Sequel and oseq codes,

  • Both of them have init (which is easily constructed from range and map, and easily allows to create range)
  • Sequel has up : int -> int -> int t and down : int -> int -> int t, which respectively act like range ~step:1 and range ~step:-1
  • Sequel has ints which is the same as count_from ~step:1
  • OSeq doesn’t have either

In conclusion, if #9312 is happening soon, we should decide

  • whether to keep up and down or range.
  • Which is the better name between ints or count_from

@c-cube
Copy link
Contributor

c-cube commented Jul 26, 2021

Nitpick: Oseq has val (--) : int -> int -> int t and val (--^) : int -> int -> int t for inclusive and exclusive ranges. I think the default should be exclusive range but it's too late for that in oseq.

@dbuenzli
Copy link
Contributor

I think the default should be exclusive range but it's too late for that in oseq.

Just curious, do you have any good justification for that ?

Two things that come to mind is that exclusive doesn't allow you to enumerate max_int and would mismatch with for loops.

In another context, most people have been arguing aswell for exclusive bounds, mainly from a proof perspective. But from an ergonomic/code reading perspective I have yet to be convinced.

E.g. it seems to me that the +1 sprinkling @stedolan mentioned there largely depends on the context. His equations do look good but personally, in practice, I keep on having to sprinkle +1s because of what String.find gives me and what String.sub asks me for :-)

Besides the Dijsktra note I mentioned there gives basically three arguments in favor of exclusive, one of which is purely esthetical ("that is ugly") and somehow moot if your datatype allows negative numbers anyway, and the two others are these +1 arguments – lack of which or not can also be seen as esthetical rather than ergonomic arguments and again, depend on the usage context.

The PL world seems definitively undecided; and so I am so far.

@c-cube
Copy link
Contributor

c-cube commented Jul 28, 2021 via email

@tbrugere
Copy link
Author

tbrugere commented Jul 28, 2021

Nitpick: Oseq has val (--) : int -> int -> int t and val (--^) : int -> int -> int t

Sorry, I ctrl-f'd too fast

For arguments in favor of exclusive range, I'd add that

  • As Dijkstra states, it is nice that range a b has b - a elements
  • in particular it is nice that range 0 n maps to the indexes of an array of length n, since we are indexing from 0

@dbuenzli
Copy link
Contributor

  • it works well for rust (for both iterators and array slices).

That's not really compelling… I'm sure most programmers using last in this list would tell it "works well" for them.

Ergonomic and nice.

You didn't mention why inclusive is ergonomic.

I don't find it ergonomic because it's not WYSIWYG. You constantly need interpretative labor (in the form of subtracting 1 to the upper bound) when you read code to understand what you really get.

In the particular case of array indexing which is prone to off-by-one errors I would rather shoot for the solution that needs less mental labor from the code reader. Besides, if we need to resort to aesthetics arguments, it puts your upper bound outside the indexing domain of your array which I don't find particularly nice, if not an impedance mismatch.

It feels wrong to not have empty ranges; and something I don't like with ocaml's for loops.

With inclusive you get empty ranges when your upper bound is strictly smaller than the lower one (e.g. for i = 1 to 0 do print_endline "empty" do).

It seems that with inclusive you don't get to express an empty range at min_int, but with exclusive you don't get to enumerate max_int. Both of which are not really important in the particular case of array zero-based indexing.

But one could argue that not being able to enumerate max_int in Seq.range is a bit of a wart (the reason why OCaml for loops are not exclusive).

  • the +1 thing I assume refers to consecutive ranges, where the start of the next needs to be n+1

The +1 thing is that depending on what functions return you and what you want to do you have to adjust indices (or length computations) by one.

I'm yet to be convinced that using inclusive bounds leads to more +1 in your code as @stedolan mentioned in the discussion I linked to. My take on this is that it is context dependent. For example depending if your chopping predicate characterises elements to keep (while) or the cutting point (until), either convention can lead you to adjust ±1 for further operations. And even if does really lead to more adjustments, the code may still end up being easier to understand (see the ergonomics argument above).

So given that I'm mostly presented with what seems to me aesthetics arguments in favor of exclusive, I'm so far still much more drawn to an inclusive bound convention as it needs less interpretative labor – but then I'm a lazy person and pretty bad at indexing puzzles.

@tbrugere
Copy link
Author

tbrugere commented Jul 29, 2021

I don't find it ergonomic because it's not WYSIWYG. You constantly need interpretative labor (in the form of subtracting 1 to the upper bound) when you read code to understand what you really get.

Yes that’s where the tradeoff resides:

  • range first last (inclusive) makes it easy to reason about bounds : the low bound is first and the high bound is last. Conversely it makes it tricky to reason about the length / number of elements, which is last - first + 1 (note the annoying + 1)
  • range start end (exclusive) makes it easy to reason about the length of the range : it is end - start. Conversely it makes it tricky to reason about the higher bound, which is end - 1 (note the annoying - 1)

In conclusion the situation is like that:

inclusive exclusive
bounds first and last range first last range first (last + 1)
start start and length len range first (start + len - 1) range first (start + len)
in particular: length n, start 0

(indexes of an array of length n)
range 0 (len + 1) range 0 len
good for specifying bounds length

It follows, an important question is: is it better to specify the bounds, or the length ?.

Honestly from my experience, I’ve had to specify the length / number of elements way more often. Especially, since I’ve been using them in python, I find it quite rare to have to write range(n + 1).

I’ve made this comparative array of my experience and what was said above. It contains, for each cases, said advantages, as well as in which cases they are better.
I’ll be happy to edit it with your arguments if you add them below

bounds / inclusive length / exclusive

- WYSIWYG
- is more intuitive at the beginning (see this stackoverflow answer )
- some tasks in scientific computing, such as calculating all the values of a function between some bounds
- going up to max_int / down to min_int

- empty ranges
- iterating over an indexed data structure (this is very frequent, which is why I put it as a special case in above table)
- repeating a task a certain number of times
- thinking about consecutive ranges (for example in divide-and-conquer algorithms, think quicksort)
- some scientific computing such as linear algebra (iterating over a n-vector, a n-matrix, etc.)

@c-cube
Copy link
Contributor

c-cube commented Jul 29, 2021 via email

@dbuenzli
Copy link
Contributor

Honestly from my experience, I’ve had to specify the length / number of elements way more often.

Since the start element is not an optional element in your current proposal I don't really see what's the problem with using 1 as start and n as last.

So I find some of the advantages for exclusive in your table quite a bit disputable, notably the one about repeating task. And as for iterating over datastructures you will likely rather invoke an M.to_seq function than use Seq.range.

Regarding indexing/slicing maybe we are a bit OT. But I still don't see what exclusive bounds makes particularly "more consistent" with 0-based indexing/slicing. Personally I find that in my indexed data structures chopping tasks I'm much more interested in specifying bounds, that is which indices are going to be part of my results.

For indexed datastructures it's interesting to ponder a bit on trying to caracterize the datatype of your function arguments. With exclusive bounds you need to introduce a new datatype/concept to be able to talk about the behaviour of functions in a meaningful way: see the discussion the about indices and positions of a string in the String module which in my opinion makes the functions of the whole string module much more complicated to understand than it could be – I like my occam razor too :-)

@c-cube
Copy link
Contributor

c-cube commented Jul 30, 2021 via email

@c-cube
Copy link
Contributor

c-cube commented Jul 30, 2021 via email

@bluddy
Copy link
Contributor

bluddy commented Jul 30, 2021

I think in the absence of a compelling case (and there's not going to be an overly compelling case one way or another), we want to aim for consistency if possible, and if not, then consistency with the majority of cases. It's far more important that the user build up an expectation of how things work than deciding between several fairly similar designs. If ranges are exclusive in the API (and I think they are), then they should stay that way. Of course, the one exception I can think of is baked into the language itself in the for construct, and I do wish that was different, as I hate having to remember to subtract one from length every time.

@dbuenzli
Copy link
Contributor

Consider the arguments of Array.sub, Bytes.blit, etc. They are all length-based,

As I mentioned when I tried to introduce String.subrange I precisely find that to be a defect :-)

In most chopping tasks you chunks get caracterized by indices (results of finding functions), having to convert them to lengths just introduces more opportunities for off-by-one errors.

@c-cube
Copy link
Contributor

c-cube commented Jul 30, 2021

In most chopping tasks you chunks get caracterized by indices (results of finding functions), having to convert them to lengths just introduces more opportunities for off-by-one errors.

That's only one specific use case though. A lot of other tasks are more amenable to offset + length, in my experience, even the most basic reading/writing with bytes with subsequent Bytes.sub_string, or blit.

@xavierleroy
Copy link
Contributor

Many functions over sequences were just introduced into 4.14, including Seq.ints, which does some of the proposed Seq.range. Do we need to keep this PR alive?

@mndrix
Copy link
Contributor

mndrix commented Jan 17, 2022

Many functions over sequences were just introduced into 4.14

For my reference, this was #10583.

Do we need to keep this PR alive?

While using 4.14, I find myself reimplementing Seq.range often enough that I'd still enjoy having it available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet