-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pca skeleton #347
base: master
Are you sure you want to change the base?
pca skeleton #347
Conversation
jaewshin
commented
May 29, 2019
- Uses incremental PCA algorithm (Set it to produce deterministic output. However, the result might not be the same as using original PCA on the entire data)
- Uses class-based UDF approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it has already taken shape quite nicely! Two general comments:
- Links to paper(s) you are using as reference would be nice (in docstring)
- I think you mentioned that the incremental PCA comes within some bounds of the conventional method - maybe you could construct a test case that verifies this property? Maybe also explicitly create test data that contains a number of components that the PCA should be able to reconstruct
Let me know if you need feedback on any specific part.
src/libertem/udf/pca.py
Outdated
pca = collections.namedtuple('PCA', ['sum_im', 'var', 'N', 'n_components', 'singular_vals']) | ||
|
||
|
||
def flip_svd(U, V): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is adapted from sklearn
, right? Should be properly credited if this is the case (it is completely fine, as long as we comply with their license). Also, first part of the docstring in sklearn is kind of important: "Sign correction to ensure deterministic output from SVD"
- it helps to understand the "why" of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed yes. For citation, there is no "official" way of citing a code directly, other than in the form of a bibliography. Also, I couldn't find doi for scikit learn anywhere... Would providing a link to the specific code file (on github) be suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For citation, there is no "official" way of citing a code directly
[...] Would providing a link to the specific code file (on github) be suffice?
I personally think a link should be fine (one could argue about the "threshold of originality" for copying 4 lines of code, but I'm not sure we want to go there 😁) but the license says:
[...]
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
a. Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
[...]
So, following the license to the letter, we need to reproduce the license in full somewhere. A good example of this, actually in sklearn: https://github.com/scikit-learn/scikit-learn/blob/a50c03f975c4bbf306c5af1826f46ba5dc5230a0/sklearn/utils/_pprint.py - in that case, the whole file is from an external project. Maybe it makes sense to keep code you borrow/adapt from sklearn into a separate file, which has the whole license as a comment at the top. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that shouldn't be a problem at all. I've been writing code for NNMF on the side, and I was looking into sklearn code for reference as well (sklearn contained some cython and extra library stuff for nnmf so I didn't copy any yet except for maybe the general idea). If in the future such an issue of copying arises again in the future, having a separate file would ease the burden of citation
Thanks a lot for your review! As for your general comment about incremental PCA, I've added the citations for papers I looked into. As for testing, I've been struggling in the past few days on constructing a test case. I will let you know if I succeed in making one or if I get too stuck and not make much progress. Also, I forgot to mention that I have yet to include the part about returning the actual principal components on the code. I will be adding it soon. |
Codecov Report
@@ Coverage Diff @@
## master #347 +/- ##
===========================================
+ Coverage 43.04% 55.53% +12.48%
===========================================
Files 173 69 -104
Lines 6711 4869 -1842
Branches 827 590 -237
===========================================
- Hits 2889 2704 -185
+ Misses 3707 2052 -1655
+ Partials 115 113 -2
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #347 +/- ##
======================================
Coverage 0.87% 0.87%
======================================
Files 114 114
Lines 2049 2049
Branches 237 237
======================================
Hits 18 18
Misses 2031 2031 Continue to review full report at Codecov.
|
Initial literature review on dimension reduction is referenced on Issue #363
|
|
||
.. image:: ./images/pca_diagram.png | ||
|
||
As mentioned above, PCA heavily exploits the distributed nature of the UDF architecutre. First, using the partitioned data from the partitioning stage of UDF, we perform standard PCA on each of the partitions. Note that the current implementation utilizes :meth:`~libertem.udf.UDFFrameMixin.process_partition` instead of :meth:`~libertem.udf.UDFPartitionMixin.process_frame` to provide additional boost in the performance of the algorithm. Then, for each partition, we store the values for the loading matrix (matrix multiplication of the left singular matrix with the singular values), and the component matrix which is of our primary interest since the component matrix serves as a projection matrix into the lower-dimensional space and thereby achieving dimension reduction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe process_frame and process_partition are mixed up?
docs/source/app/pca.rst
Outdated
|
||
As mentioned above, PCA heavily exploits the distributed nature of the UDF architecutre. First, using the partitioned data from the partitioning stage of UDF, we perform standard PCA on each of the partitions. Note that the current implementation utilizes :meth:`~libertem.udf.UDFFrameMixin.process_partition` instead of :meth:`~libertem.udf.UDFPartitionMixin.process_frame` to provide additional boost in the performance of the algorithm. Then, for each partition, we store the values for the loading matrix (matrix multiplication of the left singular matrix with the singular values), and the component matrix which is of our primary interest since the component matrix serves as a projection matrix into the lower-dimensional space and thereby achieving dimension reduction. | ||
|
||
Once PCA is independently applied on all partitions, we proceed to `merge` these results of independent PCA results, which is done rather naturally by implementing the :meth:`~libertem.udf.UDF.merge` method in UDF class. More specifically, given the results of the PCA on each partition, we first reconstruct the original matrix using the loading and the component matrices, and then reducing the dimension of the data matrix via `hyperbox` method. `Hyperbox` method is a way to reduce the dimension of the loading matrix by finding the maximal `hyperbox` that contains the span of the vectors in the loading matrix. In doing so, we can find a simpler description for the loading matrix, which then reduces the dimension of the data, thereby leading to an increase in the overall performance. At each iteration, once `hyperbox` method is performed for the partition, we use Incremental PCA algorithm (IPCA) to incrementally update the component matrix by fitting the post-`hyperbox` data matrix. One major advantage of IPCA is that it can easily handle large-size data that standard PCA algorithm cannot, and UDF architecture served as a basis for implementing Incremental PCA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it the other way around, first we reduce and then we reconstruct a reduced matrix? In general, you should make clear that we are looking for the components here, i.e. we are not creating a description for the loading matrix, but a matrix that is much smaller but yields the same components when doing PCA.
Nice use of incremental PCA for the merging, btw! :-)
For me it would be very important to describe precisely how the hyperbox method is validated and what its limits are. Did you work on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are totally right. I got mixed up in the wording of this. I will fix it and make it more clear that the component matrix is what we are looking for. As for validating the hyperbox method, I did try few things, which is on pca.ipynb under example directory, but couldn't really find a case where the hyperbox method failed, which worries me since we don't know when and where this method fails... If you have any additional ideas on how to verify this, please let me know