Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aibolit API. Use Cases. ML pipeline. Overview. Discussion. #556

Open
aravij opened this issue Jul 21, 2020 · 9 comments
Open

Aibolit API. Use Cases. ML pipeline. Overview. Discussion. #556

aravij opened this issue Jul 21, 2020 · 9 comments
Assignees
Labels
discussion There is something to discuss, before code

Comments

@aravij
Copy link
Contributor

aravij commented Jul 21, 2020

Here we want to collect the scenarios of using aibolit package and discuss the API for end-user.
Leave your comments, diagrams and text here.

@aravij aravij added the discussion There is something to discuss, before code label Jul 21, 2020
@acheshkov acheshkov changed the title ML pipline overview discussion. ML pipeline. Overview. Discussion. Jul 24, 2020
@acheshkov acheshkov changed the title ML pipeline. Overview. Discussion. Aibolit API. Use Cases. ML pipeline. Overview. Discussion. Jul 30, 2020
@KatGarmash
Copy link
Member

KatGarmash commented Jul 30, 2020

First attempt, let me know if it's not the right format:

In general, I'd like feature extraction to be customizable. Right now it is set globally in config.py.

If I may propose a way of implementing it: Have text-> feature_vector extraction included in model train and inference pipeline. A model object would store an information about features, and once a text is passed it extracts them automaticall and passes the vector to ML model.

  • model train function :
class Model:
   ....
   def fit_regressor(text: List[str], **kwargs): -> void
     """
         text:: a list of strings (programs) for training

          Extracts features (patterns) specified in **kwargs.
          Fits regressor.
          Stores fitted model in self.model, stores list of features in self.features (or somethig like this).
     """
     ....
  • test/rank function (inference):
   def rank(text: List[str], **kwargs): -> List[int], List[float]
       """
            text:: Takes list of strings (programs) -- inputs for which to produce recommendation.

            Extracts feature vectors accoding to self.features.
            Runs prediction on vectoes and does the importance calculation for each feature.
            Returns ranked list of pattern indices and corresponding importances.
       """

Some use cases where this would be useful/necessary:

  • doing ifnference with a model pretrained with a non-standard set of features.
  • doing grid search for feature selection with for example sklearn.model_selection.GridSearchCV
  • simultaneously doing different separate experiments with different feature subsets

@lyriccoder
Copy link
Member

The problem is that we have certain functionality related to recommend functionality. And it has additional functionality as filtering suppressWarnings patterns, exceptions handling. Also we have certain type of dictionary which is handled by another part of a program (it is returned by run_recommend_for_file function which will be rank function, as I have understood).

So, we cannot pass text or list of texts. We split this function before because we have lots of additional actions. Our model is closely related to aibolit recommend interface, but @KatGarmash wants to have similar function for model.

@acheshkov Can we just write another function which @KatGarmash needs. It will duplicate functionality but it will have additional features, which @KatGarmash needs?

@acheshkov
Copy link
Member

@lyriccoder yes, you can create a new function with desired interface

@KatGarmash
Copy link
Member

KatGarmash commented Aug 4, 2020

Also we have certain type of dictionary which is handled by another part of a program (it is returned by run_recommend_for_file function which will be rank function, as I have understood).

@lyriccoder what part of code is this? can you give me the link?
found it

@KatGarmash
Copy link
Member

Also we have certain type of dictionary which is handled by another part of a program (it is returned by run_recommend_for_file function which will be rank function, as I have understood).

@lyriccoder which "certain type of dictionary"?

@KatGarmash
Copy link
Member

@lyriccoder @acheshkov

I have looked through the methods for the recommend functionality and if anything, my suggestion will only simplify that code.
For example, lines 74-75 in predict method will be removed, or calculate_patterns_and_metrics method will be done internally in the model. I may have overlooked something though.

Actually, now that I have looked at the main.py code, I guess can use run_recommend_for_file for my purposes, except that I want to pass it not Agparser object with arguments but kwargs.

@KatGarmash
Copy link
Member

@acheshkov @lyriccoder updated the specification of the desired functions (see first comment). Is it better?

@lyriccoder
Copy link
Member

lyriccoder commented Aug 4, 2020

  1. We cannot do def fit_regressor(text: List[str], **kwargs): -> void, since we are training dataset, which has been created already.
    You want a different interface.

  2. Also, cannot do rank(text: List[str], **kwargs): -> List[int], List[float] since rank function already exists and it takes list of integers. We have a multi-threaded functionality which executes this function in a separate thread, it's difficult to change its interface

  3. Just notice, that if you have everything in fit_regressor, you will loose you calculated dataset. You will count it everytime, when you will run fit_regressor. It is very inconvenient and takes a lot of time.

Otherwise, you can save calculated dataset to any variable and fit as many times as you need with different features.

I will just create 2 different functions for you, @KatGarmash

@acheshkov
Copy link
Member

@acheshkov @lyriccoder updated the specification of the desired functions (see first comment). Is it better?

yes, it is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion There is something to discuss, before code
Projects
None yet
Development

No branches or pull requests

4 participants