Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array.min in 2.12.12 vs 2.13.3 #12107

Closed
kamilkloch opened this issue Aug 6, 2020 · 4 comments
Closed

Array.min in 2.12.12 vs 2.13.3 #12107

kamilkloch opened this issue Aug 6, 2020 · 4 comments

Comments

@kamilkloch
Copy link

kamilkloch commented Aug 6, 2020

Scala 2.12.12:

> println(Array(2.0, 1.0, Double.NaN).min)
NaN

Scala 2.13.3:

> println(Array(2.0, 1.0, Double.NaN).min)
1.0

Which one is correct? ;)

@Jasper-M
Copy link
Member

Jasper-M commented Aug 6, 2020

There's some history on this topic #11844

@eed3si9n
Copy link
Member

eed3si9n commented Aug 6, 2020

scala/scala#8721 is the PR that came out of it and it should summarize the situation.

@NthPortal
Copy link

Welcome to Scala 2.12.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265).
Type in expressions for evaluation. Or try :help.

scala> Array(2.0, 1.0, Double.NaN, 3.0).min
res0: Double = 3.0

2.13.3 is the one that's correct (or at the very least more correct). I don't believe we can backport a fix (though I could be mistaken)

@eed3si9n
Copy link
Member

eed3si9n commented Aug 9, 2020

Array(2.0, 1.0, Double.NaN, 3.0).min sounds like a clear bug in Scala 2.12.12, which I think we could backport a fix to return NaN.

However, in the original example, I would argue that both are "correct" in the sense that they work as intended.

Scala 2.12.12 conforms to IEEE 754 floating point semantics, like java.lang.Math.min(1.0, Double.NaN) returning NaN.

Scala 2.13.3 on the other hand conforms to total ordering like java.lang.Double.compare being able to sort an array of Double.

HeartSaVioR pushed a commit to apache/spark that referenced this issue Dec 1, 2020
…d if built with Scala 2.13

### What changes were proposed in this pull request?

This PR fixes an issue that the histogram and timeline aren't rendered in the `Streaming Query Statistics` page if we built Spark with Scala 2.13.

![before-fix-the-issue](https://user-images.githubusercontent.com/4736016/100612855-f543d700-3356-11eb-90d9-ede57b8b3f4f.png)
![NaN_Error](https://user-images.githubusercontent.com/4736016/100612879-00970280-3357-11eb-97cf-43978bbe2d3a.png)

The reason is [`maxRecordRate` can be `NaN`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala#L371) for Scala 2.13.

The `NaN` is the result of [`query.recentProgress.map(_.inputRowsPerSecond).max`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala#L372) when the first element of `query.recentProgress.map(_.inputRowsPerSecond)` is `NaN`.
Actually, the comparison logic for `Double` type was changed in Scala 2.13.
scala/bug#12107
scala/scala#6410

So this issue happens as of Scala 2.13.

The root cause of the `NaN` is [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L164).
This `NaN` seems to be an initial value of `inputTimeSec` so I think `Double.PositiveInfinity` is suitable rather than `NaN` and this change can resolve this issue.

### Why are the changes needed?

To make sure we can use the histogram/timeline with Scala 2.13.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

First, I built with the following commands.
```
$ /dev/change-scala-version.sh 2.13
$ build/sbt -Phive -Phive-thriftserver -Pscala-2.13 package
```

Then, ran the following query (this is brought from #30427 ).
```
import org.apache.spark.sql.streaming.Trigger
val query = spark
  .readStream
  .format("rate")
  .option("rowsPerSecond", 1000)
  .option("rampUpTime", "10s")
  .load()
  .selectExpr("*", "CAST(CAST(timestamp AS BIGINT) - CAST((RAND() * 100000) AS BIGINT) AS TIMESTAMP) AS tsMod")
  .selectExpr("tsMod", "mod(value, 100) as mod", "value")
  .withWatermark("tsMod", "10 seconds")
  .groupBy(window($"tsMod", "1 minute", "10 seconds"), $"mod")
  .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value"))
  .writeStream
  .format("console")
  .trigger(Trigger.ProcessingTime("5 seconds"))
  .outputMode("append")
  .start()
```

Finally, I confirmed that the timeline and histogram are rendered.
![after-fix-the-issue](https://user-images.githubusercontent.com/4736016/100612736-c9285600-3356-11eb-856d-7e53cc656c36.png)

```

Closes #30546 from sarutak/ss-nan.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants