pca skeleton #347

jaewshin · 2019-05-29T16:04:01Z

Uses incremental PCA algorithm (Set it to produce deterministic output. However, the result might not be the same as using original PCA on the entire data)
Uses class-based UDF approach

sk1p

Looks like it has already taken shape quite nicely! Two general comments:

Links to paper(s) you are using as reference would be nice (in docstring)
I think you mentioned that the incremental PCA comes within some bounds of the conventional method - maybe you could construct a test case that verifies this property? Maybe also explicitly create test data that contains a number of components that the PCA should be able to reconstruct

Let me know if you need feedback on any specific part.

src/libertem/udf/pca.py

sk1p · 2019-05-31T22:36:07Z

src/libertem/udf/pca.py

+pca = collections.namedtuple('PCA', ['sum_im', 'var', 'N',  'n_components', 'singular_vals'])
+
+
+def flip_svd(U, V):


This function is adapted from sklearn, right? Should be properly credited if this is the case (it is completely fine, as long as we comply with their license). Also, first part of the docstring in sklearn is kind of important: "Sign correction to ensure deterministic output from SVD" - it helps to understand the "why" of the function.

Indeed yes. For citation, there is no "official" way of citing a code directly, other than in the form of a bibliography. Also, I couldn't find doi for scikit learn anywhere... Would providing a link to the specific code file (on github) be suffice?

For citation, there is no "official" way of citing a code directly
[...] Would providing a link to the specific code file (on github) be suffice?

I personally think a link should be fine (one could argue about the "threshold of originality" for copying 4 lines of code, but I'm not sure we want to go there 😁) but the license says:

[...] Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: a. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. [...]

So, following the license to the letter, we need to reproduce the license in full somewhere. A good example of this, actually in sklearn: https://github.com/scikit-learn/scikit-learn/blob/a50c03f975c4bbf306c5af1826f46ba5dc5230a0/sklearn/utils/_pprint.py - in that case, the whole file is from an external project. Maybe it makes sense to keep code you borrow/adapt from sklearn into a separate file, which has the whole license as a comment at the top. What do you think?

Yeah that shouldn't be a problem at all. I've been writing code for NNMF on the side, and I was looking into sklearn code for reference as well (sklearn contained some cython and extra library stuff for nnmf so I didn't copy any yet except for maybe the general idea). If in the future such an issue of copying arises again in the future, having a separate file would ease the burden of citation

src/libertem/udf/pca.py

jaewshin · 2019-06-03T09:50:08Z

Thanks a lot for your review! As for your general comment about incremental PCA, I've added the citations for papers I looked into. As for testing, I've been struggling in the past few days on constructing a test case. I will let you know if I succeed in making one or if I get too stuck and not make much progress. Also, I forgot to mention that I have yet to include the part about returning the actual principal components on the code. I will be adding it soon.

…erging

codecov · 2019-06-05T14:33:00Z

Codecov Report

Merging #347 into master will increase coverage by 12.48%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master     #347       +/-   ##
===========================================
+ Coverage   43.04%   55.53%   +12.48%     
===========================================
  Files         173       69      -104     
  Lines        6711     4869     -1842     
  Branches      827      590      -237     
===========================================
- Hits         2889     2704      -185     
+ Misses       3707     2052     -1655     
+ Partials      115      113        -2

Impacted Files	Coverage Δ
src/libertem/io/fs.py	`39.21% <0%> (-41.18%)`	⬇️
src/libertem/io/dataset/ser.py	`0% <0%> (-32.93%)`	⬇️
src/libertem/io/dataset/blo.py	`28% <0%> (-22%)`	⬇️
src/libertem/io/dataset/frms6.py	`0% <0%> (-21.84%)`	⬇️
src/libertem/io/dataset/__init__.py	`76.47% <0%> (-17.65%)`	⬇️
src/libertem/io/utils.py	`85.71% <0%> (-14.29%)`	⬇️
src/libertem/io/dataset/empad.py	`27.72% <0%> (-11.89%)`	⬇️
src/libertem/io/dataset/raw.py	`80% <0%> (-3.34%)`	⬇️
tests/utils.py	`35.21% <0%> (-2.82%)`	⬇️
src/libertem/common/slice.py	`82.14% <0%> (-2.39%)`	⬇️
... and 109 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f55d93c...8cd063b. Read the comment docs.

codecov · 2019-06-05T14:33:06Z

Codecov Report

Merging #347 into master will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master    #347   +/-   ##
======================================
  Coverage    0.87%   0.87%           
======================================
  Files         114     114           
  Lines        2049    2049           
  Branches      237     237           
======================================
  Hits           18      18           
  Misses       2031    2031

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c0843b...e506e9a. Read the comment docs.

jaewshin · 2019-08-25T21:18:37Z

Initial literature review on dimension reduction is referenced on Issue #363

Implementation for PCA
Testing for PCA
Documentation for PCA
Jupyter notebook for testing PCA
For now, removed codes on NMF.

uellue · 2019-08-30T12:06:28Z

docs/source/app/pca.rst

+
+.. image:: ./images/pca_diagram.png
+
+As mentioned above, PCA heavily exploits the distributed nature of the UDF architecutre. First, using the partitioned data from the partitioning stage of UDF, we perform standard PCA on each of the partitions. Note that the current implementation utilizes :meth:`~libertem.udf.UDFFrameMixin.process_partition` instead of :meth:`~libertem.udf.UDFPartitionMixin.process_frame` to provide additional boost in the performance of the algorithm. Then, for each partition, we store the values for the loading matrix (matrix multiplication of the left singular matrix with the singular values), and the component matrix which is of our primary interest since the component matrix serves as a projection matrix into the lower-dimensional space and thereby achieving dimension reduction.


I believe process_frame and process_partition are mixed up?

uellue · 2019-08-30T12:10:08Z

docs/source/app/pca.rst

+
+As mentioned above, PCA heavily exploits the distributed nature of the UDF architecutre. First, using the partitioned data from the partitioning stage of UDF, we perform standard PCA on each of the partitions. Note that the current implementation utilizes :meth:`~libertem.udf.UDFFrameMixin.process_partition` instead of :meth:`~libertem.udf.UDFPartitionMixin.process_frame` to provide additional boost in the performance of the algorithm. Then, for each partition, we store the values for the loading matrix (matrix multiplication of the left singular matrix with the singular values), and the component matrix which is of our primary interest since the component matrix serves as a projection matrix into the lower-dimensional space and thereby achieving dimension reduction.
+
+Once PCA is independently applied on all partitions, we proceed to `merge` these results of independent PCA results, which is done rather naturally by implementing the :meth:`~libertem.udf.UDF.merge` method in UDF class. More specifically, given the results of the PCA on each partition, we first reconstruct the original matrix using the loading and the component matrices, and then reducing the dimension of the data matrix via `hyperbox` method. `Hyperbox` method is a way to reduce the dimension of the loading matrix by finding the maximal `hyperbox` that contains the span of the vectors in the loading matrix. In doing so, we can find a simpler description for the loading matrix, which then reduces the dimension of the data, thereby leading to an increase in the overall performance. At each iteration, once `hyperbox` method is performed for the partition, we use Incremental PCA algorithm (IPCA) to incrementally update the component matrix by fitting the post-`hyperbox` data matrix. One major advantage of IPCA is that it can easily handle large-size data that standard PCA algorithm cannot, and UDF architecture served as a basis for implementing Incremental PCA.


Isn't it the other way around, first we reduce and then we reconstruct a reduced matrix? In general, you should make clear that we are looking for the components here, i.e. we are not creating a description for the loading matrix, but a matrix that is much smaller but yields the same components when doing PCA.

Nice use of incremental PCA for the merging, btw! :-)

For me it would be very important to describe precisely how the hyperbox method is validated and what its limits are. Did you work on that?

Yes you are totally right. I got mixed up in the wording of this. I will fix it and make it more clear that the component matrix is what we are looking for. As for validating the hyperbox method, I did try few things, which is on pca.ipynb under example directory, but couldn't really find a case where the hyperbox method failed, which worries me since we don't know when and where this method fails... If you have any additional ideas on how to verify this, please let me know

jaewshin added 4 commits May 21, 2019 12:22

added extra_shape param to single type buffer

eb1b145

Merge branch 'master' into udf

97c6da6

Merge branch 'master' into udf

b4847e8

skeleton code for pca

d557419

sk1p reviewed May 31, 2019

View reviewed changes

src/libertem/udf/pca.py Outdated Show resolved Hide resolved

jaewshin added 4 commits June 3, 2019 12:24

fix nitpicks

4fb0685

adding random projection

5ad0614

Merge branch 'master' into udf

4755f6d

Implemented incremental PCA for frames and two-pass SVD merging for m…

8cd063b

…erging

jaewshin added 2 commits June 7, 2019 14:38

reimplemented incremental pca/svd merge. test yet to come

dd25de9

Merge branch 'master' into udf

31ced09

jaewshin closed this Jun 10, 2019

jaewshin added 2 commits June 14, 2019 13:53

ipca and ccfipca

ec225f6

resolved merge conflict

b7a72ff

jaewshin reopened this Jun 14, 2019

pca as class udf/need reasonable error bound for test

dc8ad84

jaewshin mentioned this pull request Jun 14, 2019

Support for dimension reduction #363

Closed

jaewshin added 9 commits June 22, 2019 18:20

pca components

04b69eb

pca+nnmf

cc4810c

fix

4604280

pca prototype

fc352b6

pca implementations

7936ae0

Merge remote-tracking branch 'upstream/master'

377af14

Merge branch 'master' into udf

c2cadb0

pca

2b2ee0c

pca notebook

5f77eba

jaewshin added 12 commits July 16, 2019 11:41

fix

02d24a4

fix norm

3ad9e2c

clean up

9d734d7

clean code

ca3b253

more clean...

cd8fd87

fixed raw

a47553d

change empad partition size to 512^3

34bb78c

edge cases

e0dd4e7

organize

182c039

some more tests

7cdc1ca

m

576bdfb

notebook

67d3d5a

sk1p added the GSOC Google Summer of Code label Aug 23, 2019

pca/removed nmf

467b3a7

uellue reviewed Aug 30, 2019

View reviewed changes

jaewshin added 2 commits September 1, 2019 16:25

fix wording of the pca doc

2962bd3

documentation

e506e9a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pca skeleton #347

pca skeleton #347

jaewshin commented May 29, 2019

sk1p left a comment

sk1p May 31, 2019

jaewshin Jun 3, 2019

sk1p Jun 3, 2019

jaewshin Jun 3, 2019

jaewshin commented Jun 3, 2019 •

edited

codecov bot commented Jun 5, 2019

codecov bot commented Jun 5, 2019 •

edited

jaewshin commented Aug 25, 2019

uellue Aug 30, 2019

uellue Aug 30, 2019

jaewshin Sep 1, 2019

		pca = collections.namedtuple('PCA', ['sum_im', 'var', 'N', 'n_components', 'singular_vals'])


		def flip_svd(U, V):


		.. image:: ./images/pca_diagram.png

		As mentioned above, PCA heavily exploits the distributed nature of the UDF architecutre. First, using the partitioned data from the partitioning stage of UDF, we perform standard PCA on each of the partitions. Note that the current implementation utilizes :meth:`~libertem.udf.UDFFrameMixin.process_partition` instead of :meth:`~libertem.udf.UDFPartitionMixin.process_frame` to provide additional boost in the performance of the algorithm. Then, for each partition, we store the values for the loading matrix (matrix multiplication of the left singular matrix with the singular values), and the component matrix which is of our primary interest since the component matrix serves as a projection matrix into the lower-dimensional space and thereby achieving dimension reduction.


		As mentioned above, PCA heavily exploits the distributed nature of the UDF architecutre. First, using the partitioned data from the partitioning stage of UDF, we perform standard PCA on each of the partitions. Note that the current implementation utilizes :meth:`~libertem.udf.UDFFrameMixin.process_partition` instead of :meth:`~libertem.udf.UDFPartitionMixin.process_frame` to provide additional boost in the performance of the algorithm. Then, for each partition, we store the values for the loading matrix (matrix multiplication of the left singular matrix with the singular values), and the component matrix which is of our primary interest since the component matrix serves as a projection matrix into the lower-dimensional space and thereby achieving dimension reduction.

		Once PCA is independently applied on all partitions, we proceed to `merge` these results of independent PCA results, which is done rather naturally by implementing the :meth:`~libertem.udf.UDF.merge` method in UDF class. More specifically, given the results of the PCA on each partition, we first reconstruct the original matrix using the loading and the component matrices, and then reducing the dimension of the data matrix via `hyperbox` method. `Hyperbox` method is a way to reduce the dimension of the loading matrix by finding the maximal `hyperbox` that contains the span of the vectors in the loading matrix. In doing so, we can find a simpler description for the loading matrix, which then reduces the dimension of the data, thereby leading to an increase in the overall performance. At each iteration, once `hyperbox` method is performed for the partition, we use Incremental PCA algorithm (IPCA) to incrementally update the component matrix by fitting the post-`hyperbox` data matrix. One major advantage of IPCA is that it can easily handle large-size data that standard PCA algorithm cannot, and UDF architecture served as a basis for implementing Incremental PCA.

pca skeleton #347

Are you sure you want to change the base?

pca skeleton #347

Conversation

jaewshin commented May 29, 2019

sk1p left a comment

Choose a reason for hiding this comment

sk1p May 31, 2019

Choose a reason for hiding this comment

jaewshin Jun 3, 2019

Choose a reason for hiding this comment

sk1p Jun 3, 2019

Choose a reason for hiding this comment

jaewshin Jun 3, 2019

Choose a reason for hiding this comment

jaewshin commented Jun 3, 2019 • edited

codecov bot commented Jun 5, 2019

Codecov Report

codecov bot commented Jun 5, 2019 • edited

Codecov Report

jaewshin commented Aug 25, 2019

uellue Aug 30, 2019

Choose a reason for hiding this comment

uellue Aug 30, 2019

Choose a reason for hiding this comment

jaewshin Sep 1, 2019

Choose a reason for hiding this comment

jaewshin commented Jun 3, 2019 •

edited

codecov bot commented Jun 5, 2019 •

edited