Entropy #24

LukeMathWalker · 2019-01-27T17:17:27Z

This PR adds a new extension trait for Information Theory quantities - so far, entropy, cross entropy and Kullback-Leibler divergence.

Open question: returning an Option for cross entropy and KL divergence conflates two different "failure" modes - None when the arrays are empty, None when we have a dimension mismatch.
Would it be better to return a Result<Option<A>, E>, with a custom DimensionMismatch error type?

jturner314

This looks like useful functionality. I've added some questions and comments.

One other thought – in general, I like matching the behavior of NumPy/SciPy unless we have a specific reason to do something different, because doing so makes it easier to port code from Python to Rust.

src/entropy.rs

jturner314 · 2019-02-25T22:25:56Z

src/entropy.rs

+    ///             i=1
+    /// ```
+    ///
+    /// If the arrays are empty or their lengths are not equal, `None` is returned.


I agree that it's not good to combine these two cases. A few options:

Result<Option<A>, DimensionMismatch> would be fine.

I think Result<A, ShapeError> is a bit cleaner (where ShapeError is an enum with Empty and ShapeMismatch variants).

Another approach is Option<A>, returning None in the empty case and panicking if q cannot be broadcast to the shape of p. I generally avoid panicking, but I think this is the best option in this case because:

It's more consistent with ndarray's methods that operate on two arrays.

If the caller passes in arrays of mismatched shapes, it's clearly a bug in their code.

If the arrays have mismatched shapes, it's hard to see how the caller would recover from this error case.

The question then is why we should return None in the empty case instead of panicking. I think this makes sense because it's much more likely than a shape mismatch, doesn't really indicate a bug in the caller, and is more likely to be recoverable.

I'd rather go for Result<A, ShapeError> - in the end, the caller can choose how to recover from errors (free to call unwrap or expect).
I can foresee various scenarios for both failure modes:

empty because you filtered some values out of another array/vec, but now nothing is left;

shape mismatch, because you read two different samples from files on disk which you expected to have the same number of values but... they don't (real life stories here 😅).

To provide a plausible use case: it might be impossible to recover (let the program resume its expected execution flow) but there might be some actions one might want to perform before panicking for example (logging is the first that comes to my mind or sending a request somewhere in a web application context).
One can always catch the panic using stuff like this but it's less pleasant and not clear at a first glance looking at the function signature.

Okay, that makes sense. We have to return an Option/Result anyway, so we might as well handle all the error cases without panicking. Along the same lines, should we return an error on negative values instead of panicking? I'd expect negative values to be a bigger issue than shape mismatches, and we're already checking if values are zero.

On that one I am torn, because the cost of checking for negative values scales with the number of elements given that it's an additional check. Do you think it's worth it?

Thinking over it again, Result<Option<A>, DimensionMismatch> is probably the best signature, given that most of our methods return an Option<A> with None when arrays are empty. Consistency is probably worth the double wrap (unless we want to use DimensionMismatch::Empty instead of Option in the rest of the API).

I didn't realize earlier that the ln of a negative number is NaN (except for the noisy float types, which panic). That behavior is fine with me.

Fwiw, I did a quick benchmark of the cost of adding a check for negative numbers, and it's pretty negligible (<2%) (the horizontal axis is arr.len(), and vertical axis is time to compute arr.entropy(); 'entropy2' is the checked implementation):

fn entropy2(&self) -> Option<A> where A: Float { if self.len() == 0 { None } else { let mut negative = false; let entropy = self.mapv( |x| { if x < A::zero() { negative = true; A::zero() } else if x == A::zero() { A::zero() } else { x * x.ln() } } ).sum(); if negative { None } else { Some(-entropy) } } }

(Note that I wouldn't actually return an Option in this case; I'd return a Result instead.)

Of course, if we check for negative numbers, someone might wonder why we don't check for numbers greater than 1 too.

I could go either way on the negative number checks. The explicit check is nice, but it adds complexity and we aren't checking other things such as values greater than 1 or the sum of values not being 1, so I guess I'd lean towards leaving off negative number checks for right now.

Thinking over it again, Result<Option<A>, DimensionMismatch> is probably the best signature, given that most of our methods return an Option<A> with None when arrays are empty.

Okay, that's fine.

unless we want to use DimensionMismatch::Empty instead of Option in the rest of the API

That wouldn't be too bad, although I don't really like returning an error enum when only one of the variants is possible.

src/entropy.rs

Co-Authored-By: LukeMathWalker <LukeMathWalker@users.noreply.github.com>

…o entropy

LukeMathWalker · 2019-03-09T10:57:42Z

I have addressed all issues I think - it should be ready for merging @jturner314

jturner314

I've added a few minor comments and one more important one (using Error instead of Fail). Otherwise, everything looks good.

src/entropy.rs

src/errors.rs

Co-Authored-By: LukeMathWalker <LukeMathWalker@users.noreply.github.com>

…o entropy

LukeMathWalker · 2019-03-10T10:54:26Z

All fixed, I'll squash and merge 👍

LukeMathWalker added 15 commits January 22, 2019 08:55

Add entropy trait

76986a9

Remove unnecessary imports

1395859

Implemented entropy

b741b3d

Return entropy with reversed sign, as per definition

7b4f6d5

Fixed tests

cc221e2

Added signature for cross entropy

7473815

Fixed typo.

ea3e81f

Implemented cross_entropy

2998395

Added tests

bfc3d22

Refined panic condition

a69ed91

Added test vs SciPy

3d7929b

Added test vs SciPy

27dbd00

Added KL divergence

873871b

Added KL tests

dc85e9a

Renamed to kl_divergence

ca95788

LukeMathWalker mentioned this pull request Jan 27, 2019

Roadmap #1

Open

17 tasks

jturner314 reviewed Feb 25, 2019

View reviewed changes

jturner314 and others added 13 commits February 25, 2019 23:09

Update src/entropy.rs

d21a0bb

Co-Authored-By: LukeMathWalker <LukeMathWalker@users.noreply.github.com>

Improved docs on behaviour with not normalised arrays

c127428

Improved docs on behaviour with not normalised arrays

0106d65

Use mapv

b28f461

Styling on closures (avoid dereferencing)

ddf358b

Allow different data ownership to interact in kl_divergence

afdcf06

Allow different data ownership to interact in kl_divergence

28b4efd

Allow different data ownership to interact in cross_entropy

8c04f9c

Add a test

450cfb4

Doc improvement

5d45bdf

Check the whole shape

5c72f55

Merge remote-tracking branch 'origin/entropy' into entropy

168ffa5

Fix docs

bb38763

LukeMathWalker and others added 13 commits February 26, 2019 09:02

Broken usage of Zip

c470a3a

Fixed zip, mistery

e4be9b9

Use Zip for cross_entropy

57537c3

Add failure crate as dependency

80198bc

Errors module

93371f8

Use failure crate

5f6a004

Add ShapeMismatch error

42c3600

Merge branch 'entropy' of github.com:LukeMathWalker/ndarray-stats int…

02a63de

…o entropy

Return Result

05d5c66

Fix test suite

99a391e

Fix docs

3a3d1f6

Fix docs

ca31af8

Add docs to error

e65ef61

jturner314 reviewed Mar 9, 2019

View reviewed changes

jturner314 and others added 10 commits March 10, 2019 10:24

Update src/entropy.rs

e39025c

Co-Authored-By: LukeMathWalker <LukeMathWalker@users.noreply.github.com>

Update src/entropy.rs

99b999f

Co-Authored-By: LukeMathWalker <LukeMathWalker@users.noreply.github.com>

Update src/entropy.rs

ac4c159

Co-Authored-By: LukeMathWalker <LukeMathWalker@users.noreply.github.com>

Better semantic

b429ec7

Use Error instead of Fail

e9679fa

Formatting

d2dfe8f

Merge branch 'master' into entropy

d8583d2

Fix TOC

b8ed3ed

Module docstring

c961d9f

Merge branch 'entropy' of github.com:LukeMathWalker/ndarray-stats int…

e7ec4b4

…o entropy

LukeMathWalker merged commit 8e0e1cf into rust-ndarray:master Mar 10, 2019

LukeMathWalker deleted the entropy branch March 10, 2019 10:55

jturner314 mentioned this pull request Apr 1, 2019

Histogram error handling #25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entropy #24

Entropy #24

LukeMathWalker commented Jan 27, 2019

jturner314 left a comment

jturner314 Feb 25, 2019

LukeMathWalker Feb 25, 2019

LukeMathWalker Feb 25, 2019 •

edited

jturner314 Feb 26, 2019

LukeMathWalker Feb 26, 2019

LukeMathWalker Feb 26, 2019

jturner314 Feb 27, 2019

LukeMathWalker commented Mar 9, 2019

jturner314 left a comment

LukeMathWalker commented Mar 10, 2019

Entropy #24

Entropy #24

Conversation

LukeMathWalker commented Jan 27, 2019

jturner314 left a comment

Choose a reason for hiding this comment

jturner314 Feb 25, 2019

Choose a reason for hiding this comment

LukeMathWalker Feb 25, 2019

Choose a reason for hiding this comment

LukeMathWalker Feb 25, 2019 • edited

Choose a reason for hiding this comment

jturner314 Feb 26, 2019

Choose a reason for hiding this comment

LukeMathWalker Feb 26, 2019

Choose a reason for hiding this comment

LukeMathWalker Feb 26, 2019

Choose a reason for hiding this comment

jturner314 Feb 27, 2019

Choose a reason for hiding this comment

LukeMathWalker commented Mar 9, 2019

jturner314 left a comment

Choose a reason for hiding this comment

LukeMathWalker commented Mar 10, 2019

LukeMathWalker Feb 25, 2019 •

edited