Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pca skeleton #347

Draft
wants to merge 37 commits into
base: master
Choose a base branch
from
Draft

pca skeleton #347

wants to merge 37 commits into from

Conversation

jaewshin
Copy link
Contributor

  • Uses incremental PCA algorithm (Set it to produce deterministic output. However, the result might not be the same as using original PCA on the entire data)
  • Uses class-based UDF approach

Copy link
Member

@sk1p sk1p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it has already taken shape quite nicely! Two general comments:

  1. Links to paper(s) you are using as reference would be nice (in docstring)
  2. I think you mentioned that the incremental PCA comes within some bounds of the conventional method - maybe you could construct a test case that verifies this property? Maybe also explicitly create test data that contains a number of components that the PCA should be able to reconstruct

Let me know if you need feedback on any specific part.

src/libertem/udf/pca.py Outdated Show resolved Hide resolved
src/libertem/udf/pca.py Outdated Show resolved Hide resolved
src/libertem/udf/pca.py Outdated Show resolved Hide resolved
src/libertem/udf/pca.py Outdated Show resolved Hide resolved
src/libertem/udf/pca.py Outdated Show resolved Hide resolved
src/libertem/udf/pca.py Outdated Show resolved Hide resolved
pca = collections.namedtuple('PCA', ['sum_im', 'var', 'N', 'n_components', 'singular_vals'])


def flip_svd(U, V):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is adapted from sklearn, right? Should be properly credited if this is the case (it is completely fine, as long as we comply with their license). Also, first part of the docstring in sklearn is kind of important: "Sign correction to ensure deterministic output from SVD" - it helps to understand the "why" of the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed yes. For citation, there is no "official" way of citing a code directly, other than in the form of a bibliography. Also, I couldn't find doi for scikit learn anywhere... Would providing a link to the specific code file (on github) be suffice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For citation, there is no "official" way of citing a code directly
[...] Would providing a link to the specific code file (on github) be suffice?

I personally think a link should be fine (one could argue about the "threshold of originality" for copying 4 lines of code, but I'm not sure we want to go there 😁) but the license says:

[...]
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

  a. Redistributions of source code must retain the above copyright notice,
     this list of conditions and the following disclaimer.
[...]

So, following the license to the letter, we need to reproduce the license in full somewhere. A good example of this, actually in sklearn: https://github.com/scikit-learn/scikit-learn/blob/a50c03f975c4bbf306c5af1826f46ba5dc5230a0/sklearn/utils/_pprint.py - in that case, the whole file is from an external project. Maybe it makes sense to keep code you borrow/adapt from sklearn into a separate file, which has the whole license as a comment at the top. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that shouldn't be a problem at all. I've been writing code for NNMF on the side, and I was looking into sklearn code for reference as well (sklearn contained some cython and extra library stuff for nnmf so I didn't copy any yet except for maybe the general idea). If in the future such an issue of copying arises again in the future, having a separate file would ease the burden of citation

src/libertem/udf/pca.py Outdated Show resolved Hide resolved
@jaewshin
Copy link
Contributor Author

jaewshin commented Jun 3, 2019

Thanks a lot for your review! As for your general comment about incremental PCA, I've added the citations for papers I looked into. As for testing, I've been struggling in the past few days on constructing a test case. I will let you know if I succeed in making one or if I get too stuck and not make much progress. Also, I forgot to mention that I have yet to include the part about returning the actual principal components on the code. I will be adding it soon.

@codecov
Copy link

codecov bot commented Jun 5, 2019

Codecov Report

Merging #347 into master will increase coverage by 12.48%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #347       +/-   ##
===========================================
+ Coverage   43.04%   55.53%   +12.48%     
===========================================
  Files         173       69      -104     
  Lines        6711     4869     -1842     
  Branches      827      590      -237     
===========================================
- Hits         2889     2704      -185     
+ Misses       3707     2052     -1655     
+ Partials      115      113        -2
Impacted Files Coverage Δ
src/libertem/io/fs.py 39.21% <0%> (-41.18%) ⬇️
src/libertem/io/dataset/ser.py 0% <0%> (-32.93%) ⬇️
src/libertem/io/dataset/blo.py 28% <0%> (-22%) ⬇️
src/libertem/io/dataset/frms6.py 0% <0%> (-21.84%) ⬇️
src/libertem/io/dataset/__init__.py 76.47% <0%> (-17.65%) ⬇️
src/libertem/io/utils.py 85.71% <0%> (-14.29%) ⬇️
src/libertem/io/dataset/empad.py 27.72% <0%> (-11.89%) ⬇️
src/libertem/io/dataset/raw.py 80% <0%> (-3.34%) ⬇️
tests/utils.py 35.21% <0%> (-2.82%) ⬇️
src/libertem/common/slice.py 82.14% <0%> (-2.39%) ⬇️
... and 109 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f55d93c...8cd063b. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 5, 2019

Codecov Report

Merging #347 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@          Coverage Diff           @@
##           master    #347   +/-   ##
======================================
  Coverage    0.87%   0.87%           
======================================
  Files         114     114           
  Lines        2049    2049           
  Branches      237     237           
======================================
  Hits           18      18           
  Misses       2031    2031

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c0843b...e506e9a. Read the comment docs.

@jaewshin jaewshin closed this Jun 10, 2019
@jaewshin jaewshin reopened this Jun 14, 2019
@sk1p sk1p added the GSOC Google Summer of Code label Aug 23, 2019
@jaewshin
Copy link
Contributor Author

Initial literature review on dimension reduction is referenced on Issue #363

  • Implementation for PCA
  • Testing for PCA
  • Documentation for PCA
  • Jupyter notebook for testing PCA
    For now, removed codes on NMF.


.. image:: ./images/pca_diagram.png

As mentioned above, PCA heavily exploits the distributed nature of the UDF architecutre. First, using the partitioned data from the partitioning stage of UDF, we perform standard PCA on each of the partitions. Note that the current implementation utilizes :meth:`~libertem.udf.UDFFrameMixin.process_partition` instead of :meth:`~libertem.udf.UDFPartitionMixin.process_frame` to provide additional boost in the performance of the algorithm. Then, for each partition, we store the values for the loading matrix (matrix multiplication of the left singular matrix with the singular values), and the component matrix which is of our primary interest since the component matrix serves as a projection matrix into the lower-dimensional space and thereby achieving dimension reduction.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe process_frame and process_partition are mixed up?


As mentioned above, PCA heavily exploits the distributed nature of the UDF architecutre. First, using the partitioned data from the partitioning stage of UDF, we perform standard PCA on each of the partitions. Note that the current implementation utilizes :meth:`~libertem.udf.UDFFrameMixin.process_partition` instead of :meth:`~libertem.udf.UDFPartitionMixin.process_frame` to provide additional boost in the performance of the algorithm. Then, for each partition, we store the values for the loading matrix (matrix multiplication of the left singular matrix with the singular values), and the component matrix which is of our primary interest since the component matrix serves as a projection matrix into the lower-dimensional space and thereby achieving dimension reduction.

Once PCA is independently applied on all partitions, we proceed to `merge` these results of independent PCA results, which is done rather naturally by implementing the :meth:`~libertem.udf.UDF.merge` method in UDF class. More specifically, given the results of the PCA on each partition, we first reconstruct the original matrix using the loading and the component matrices, and then reducing the dimension of the data matrix via `hyperbox` method. `Hyperbox` method is a way to reduce the dimension of the loading matrix by finding the maximal `hyperbox` that contains the span of the vectors in the loading matrix. In doing so, we can find a simpler description for the loading matrix, which then reduces the dimension of the data, thereby leading to an increase in the overall performance. At each iteration, once `hyperbox` method is performed for the partition, we use Incremental PCA algorithm (IPCA) to incrementally update the component matrix by fitting the post-`hyperbox` data matrix. One major advantage of IPCA is that it can easily handle large-size data that standard PCA algorithm cannot, and UDF architecture served as a basis for implementing Incremental PCA.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it the other way around, first we reduce and then we reconstruct a reduced matrix? In general, you should make clear that we are looking for the components here, i.e. we are not creating a description for the loading matrix, but a matrix that is much smaller but yields the same components when doing PCA.

Nice use of incremental PCA for the merging, btw! :-)

For me it would be very important to describe precisely how the hyperbox method is validated and what its limits are. Did you work on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are totally right. I got mixed up in the wording of this. I will fix it and make it more clear that the component matrix is what we are looking for. As for validating the hyperbox method, I did try few things, which is on pca.ipynb under example directory, but couldn't really find a case where the hyperbox method failed, which worries me since we don't know when and where this method fails... If you have any additional ideas on how to verify this, please let me know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GSOC Google Summer of Code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants