added fs2.text.linesWithEndings #2418

fizzy33 · 2021-05-23T21:39:39Z

Worth noting that if we wanted to replace fs2.text.lines with

  /** Transforms a stream of `String` such that each emitted `String` is a line from the input. */
  def lines[F[_]]: Pipe[F, String, String] =
    s => linesWithEndings(s).map(_._1)

The performance hit using LinesBenchmark on my mac mini 2021 was down 3K ops/s from 96K ops/s to 93K ops/s. Which seems sensible as there are a few more allocations and one more map.

The unit test is a robotic copy of the test for lines.

So one open question I have is to replace fs2.text.lines or not.

While more LineEnding's could be expressed the current code doesn't support them so the effort was on expressing simply what the existing system does.

diesalbla

The code seems a positive contribution as is, so I am approving.

However, I would like to ask you to consider two possible optimisations:

Reuse existing streams.

At present, your new method is going to allocate a new string, with its new byte array, for each element in the output. However, suppose that your input contains the following strings (one per line):

Once upon a midnight dreary,\r while I pondered, weak and 
weary,\r Over many a quaint and curious volume of forgotten lore\r

Here, the output should look as follows:

Once upon a midnight dreary,
while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore

If you notice, the first and last strings of output are each contained in the strings from input. One possible optimisation, thus, would be to find out which output-lines are entirely contained within one input line, and use the .substring method to get them out, without the extra process of the StringBuilder.

NOTE I also hoped that substring would be reusing the underlying array, but that may not be the case... https://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/String.java#l1945

nikiforo

I haven't had a task, where I needed to preserve different linebreaks. However, I believe there can be such a task.
Concerning the PR, I think that EOF is unnecessary. I understand that the intention was to get rid of nasty Option in (String, Option[LineEnding]), but from my POV the introduction of EOF does more harm than good.
I thought a bit about adding line breaking symbols into the StringBuilder, but that would make it harder to implement lines in terms of linesWithEnding(we would have to look at the last and the second to the last symbols and then cut them off with substring).
So, I’d prefer this signature: def linesWithEndings[F[_]]: Pipe[F, String, (String,Option[LineEnding])] to def linesWithEndings[F[_]]: Pipe[F, String, (String, LineEnding)], and the implementation of lines that you’ve proposed

nikiforo · 2021-05-24T11:39:46Z

core/shared/src/main/scala/fs2/text.scala

+        first: Boolean
+    ): Pull[F, (String, LineEnding), Unit] =
+      stream.pull.uncons.flatMap {
+        case None => if (first) Pull.done else Pull.output1(stringBuilder.result() -> LineEnding.EOF)


If the EOF remains, I think we should replace Pull.done with Pull.output1("", LineEnding.EOF). Otherwise there would be no EOF at the end of the stream, which breaks "EOF at the end of the stream" consistency.

nikiforo · 2021-05-24T11:45:25Z

core/shared/src/main/scala/fs2/text.scala

+//    case object CR extends LineEnding("\r")  looking at the existing code this won't be produced so leaving it out for now
+    case object LF extends LineEnding("\n")
+    case object CRLF extends LineEnding("\r\n")
+    case object EOF extends LineEnding("")


I think that EOF is technically not a LineEnding, but I can't come up with the appropriate name for the group(LF, CRLF, EOF).

lines is not only about files, but about Stream[F, String]. Thus socket, pure streams are also consumable by .lines operator.

fizzy33 · 2021-05-24T12:53:41Z

@diesalbla fwiw I spent about 2 hours in morbid curiousity trying to eek more performance including trying what you suggested in various incarnations and couldn't eek out any more performance. Any case where substring's were used were signficantly less performant. Honestly really surprised by this. I also wanted to see if I could get the performance we wanted using scanChunks but no success. Someone already worked really hard to get it to perform. I suspect the StringBuilder allocated on the call stack is very cheap along with the arrays that StringBuilder.toString generates. They show to be significantly cheaper than String.substring in the profiler I was using.

@nikiforo we have the use case for this in http4s where in roundtrip unit tests it is needed to properly reify the input data with the expected and actual data.

Agreed on the EOF I have similar mis-givings. Does EOS (end of string | end of stream) lessen it ? Suggested from the comment that lines is about streams not files. I toss and turn on the Option[LineEnding] versus having EOF as well. Chose EOF because it was less allocations and assuming all else is equal (which it isn't).

mpilquist · 2021-05-24T12:57:54Z

If we went the option route, we'd probably need to cache instances of Some(CRLF) and Some(CR), right?

The Java socket API uses end-of-input and end-of-output. I think either EOF or EOS would be fine here.

nikiforo · 2021-05-24T13:30:04Z

If we went the option route, we'd probably need to cache instances of Some(CRLF) and Some(CR), right?

😃

The Java socket API uses end-of-input and end-of-output. I think either EOF or EOS would be fine here.

Consider this line.
If you Stream("one\r\ntwo").linesWithEndings you would get Stream(("one", CRLF), ("two", EOS)). That looks fine.
However, if you Stream.empty[String].linesWithEndings you would get either Stream.empty and no EOS, or Stream("", EOS). Both of them look wrong from where I stand.

fizzy33 · 2021-05-26T14:58:05Z

So oddly enough I spent the morning hacking through ServerSentEvent unit tests in http4s. ServerSentEvent defines end event as an empty line terminated with a \n. The existing protocol decoder was using fs2.text.lines and in the absence of line endings assumed \n which created corner cases when combined with this related behaviour fs2.Stream("\n").through(fs2.text.lines).toVector.size == 2.

@nikiforo great point the little response I have for you there is the existing behaviour I mentioned of fs2.Stream("\n").through(fs2.text.lines).toVector.size == 2 (not quite an exact analogy)

I have no strong opinion on EOS or Option[LineEnding] I can make both work. Being new here I am not sure how we come to consensus. There are more "exotic" ways to encode this (just throwing spaghetti against the wall),

sealed trait TextOrLineEnding
case class Text(value: String) extends TextOrLineEnding
case class LineEnding(value: String) extends TextOrLineEnding
def linesWithEndings[F[_]]: Pipe[F, String, TextOrLineEnding] = {

Noting with this encoding you lose the obvious no two Text instances in a row but it does have legs.

Another alternate is have the text be optional

def linesWithEndings[F[_]]: Pipe[F, String, (Option[String],LineEnding)] = {

You could arguably pass the @nikiforo test with
Stream.empty[String].linesWithEndings == Stream(None -> EOS)

Anyway I am having a bit of fun just to see if any bright ideas pop out. There is a practical need here. One needs the actual line endings if you are decoding protocols where \n \r\n \r and End are not considered equivalent

Anyone else find it fascinating something as simple as this can get so intricate ?

added fs2.text.linesWithEndings

429a98e

diesalbla approved these changes May 23, 2021

View reviewed changes

nikiforo reviewed May 24, 2021

View reviewed changes

fizzy33 mentioned this pull request May 26, 2021

ServerSentEvent support for data being a multi-line string and possible other ServerSentEvent enhancements http4s/http4s#4865

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added fs2.text.linesWithEndings #2418

added fs2.text.linesWithEndings #2418

fizzy33 commented May 23, 2021

diesalbla left a comment

nikiforo left a comment •

edited

nikiforo May 24, 2021

nikiforo May 24, 2021

fizzy33 commented May 24, 2021

mpilquist commented May 24, 2021

nikiforo commented May 24, 2021

fizzy33 commented May 26, 2021 •

edited

added fs2.text.linesWithEndings #2418

Are you sure you want to change the base?

added fs2.text.linesWithEndings #2418

Conversation

fizzy33 commented May 23, 2021

diesalbla left a comment

Choose a reason for hiding this comment

Reuse existing streams.

nikiforo left a comment • edited

Choose a reason for hiding this comment

nikiforo May 24, 2021

Choose a reason for hiding this comment

nikiforo May 24, 2021

Choose a reason for hiding this comment

fizzy33 commented May 24, 2021

mpilquist commented May 24, 2021

nikiforo commented May 24, 2021

fizzy33 commented May 26, 2021 • edited

nikiforo left a comment •

edited

fizzy33 commented May 26, 2021 •

edited