Speed-Up Preprocessing + NLP #162

henrifroese · 2020-08-23T21:55:22Z

We spent a lot of time looking at every possible options to speed up the library. Main findings:

str.replace and s.apply(re.sub) are equally fast (see Speed up preprocessing module #124) -> don't need to change this
dask requires users to use different data structures themselves, so it reduces complexity significantly for users. We believe only few users will profit from this when they're using datasets that do not fit into RAM, so it does not make sense to change all of texthero for this.
modin as drop-in replacement for pandas performed very bad for small to medium sized DFs
pandarallel did not work for pd.apply inside the library
Implementing multithreading through a decorator will not work in python; see every result when googling python multiprocessing decorator

The only way we see to parallelize things while still keeping a great User Experience and not slowing down users with small and medium sized datasets is the following:

Implement a function helper.parallel that handles parallelization for Pandas Series and use that to wrap our functions. Example:

# Example old approach:
@InputSeries(TextSeries)
def remove_round_brackets(s: TextSeries) -> TextSeries:
    """
    some nice docstring
    """
    return s.str.replace(r"\([^()]*\)", "")     

# Example new approach: (same for user from the outside, 
# a little more complicated (but fully parallelized!) on the inside)

def _remove_round_brackets(s: TextSeries) -> TextSeries:
    return s.str.replace(r"\([^()]*\)", "")


@InputSeries(TextSeries)
def remove_round_brackets(s: TextSeries) -> TextSeries:
    """
    some nice docstring
    """
    return parallel(s, _remove_round_brackets)

So we keep all functionality that is not parallelizable (like initializing patterns, downloading a spacy model, ...) in a function f with a nice docstring that the user can see. This function at the end calls helper.parallel(s, _f, other arguments) where _f houses all the parallelizable functionality.

Here's the helper.parallel implementation, should be rather self-explanatory:

import pandas as pd
import multiprocessing as mp

cores = mp.cpu_count()
partitions = cores

MIN_LINES_FOR_PARALLELIZATION = 10000  # Set min. number of lines in Series to parallelize -> no slowdown for small datasets due to parallelization overhead
PARALLELIZE = True  # Allows users to fully turn off parallelization for all dataset sizes


def parallel(s, func, *args, **kwargs):

    if len(s) < MIN_LINES_FOR_PARALLELIZATION or not PARALLELIZE:
        # Execute as usual.
        return func(s, *args, **kwargs)

    else:
        # Execute in parallel.

        # Split the data up into batches.
        s_split = np.array_split(s, partitions)

        # Open threadpool.
        pool = mp.Pool(cores)
        # Execute in parallel and concat results (order is kept).
        s_result = pd.concat(
            pool.map(functools.partial(func, *args, **kwargs), s_split)
        )

        pool.close()
        pool.join()

        return s_result

We know that this does introduce some added complexity for developers in the nlp and preprocessing modules, but we believe that it's the best solution for users (automatic parallelization and same user experience as before) and with some more developer documentation, other contributors should not have any problems.

MultiProcessing Implementation in Texthero.pdf

Of course, lots of tests are added to test_helper.py.

Note: only so many lines changed because this builds upon #157

suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <hf2000510@gmail.com>

*missing: test adopting for new types Co-authored-by: Henri Froese <hf2000510@gmail.com>

WIP Co-authored-by: Henri Froese <hf2000510@gmail.com>

…nto decorator_for_parallelization

missing documentation Co-authored-by: Henri Froese <hf2000510@gmail.com>

henrifroese · 2020-08-29T14:06:38Z

@jbesomi , we have now tested more with Modin and thought about different implementations. The problem is that there is no modin.apply or something similar that we can just use. If we have a modin Series s we profit from the speedup of doing s.apply(...). So we would need the users to do import modin.pandas as pd and we would also need to change our library everywhere to return modin-pandas objects. We see the following options:

Use our implementation from above for parallelization that is extremely simple for users and relatively simple for developers. Describe in a tutorial that if users want to work with huge datasets that don't fit into RAM, they need a different solution (e.g. modin) that does much more than just parallelizing (partitioning DataFrames etc.)
Change our library to always check at the beginning if the input is a modin object. If it is, only use code that's optimized with modin (that's not too restrictive as modin covers a lot of the pandas API) and return modin objects. Else, just work as usual. Then just tell users "if you work with big datasets, just import modin.pandas as pd and we take care of the rest".
Fully switch to modin. That's possible and the easiest way, but very slow for everything that's below ~ a few 100k rows.
Change our library to check if the input is over MIN_LINES_FOR_PARALLELIZATION. If it is, transform the input to a modin object, profit from the modin parallelization, and re-transform the output from modin to pandas. This adds a big overhead for the converisons, so we're not a fan.

Options 3+4 seem like a bad idea. Option 1 is the easiest as it's already ready to merge and gives users a great speed boost for data that fits into RAM (probably the case for most users). Option 2 is probably the best for users with big datasets, as they will profit from the whole modin framework (partitioning etc.) and can use our library fully parallelized; of course, it's a little more for us to code, but not that much 🥴 🥈 .

Interested in what you think

jbesomi · 2020-09-01T11:15:32Z

Thank you for the detailed comments!

I'm a fan of this approach! 🎉
That's not for now. We can start with 1, see how it goes, and eventually do 2.
Not a big fan.
It would probably be the best approach in case the transformation isn't expensive, but that's not the case, right? Can you quantify it?

mk2510 · 2020-09-01T16:27:24Z

we have now created a notebook, which compares the two implementations (1 and 4). The conversion, as you can see at the bottom of the notebook is quite cheap, but the computation part takes about twice as long as we could archive it. 🥇
See this pdf as an example.

Ziped Notebook ⚡

mk2510 · 2020-09-22T08:53:20Z

we now have merged #157 or the current master into this PR and are ready for review/merge 🐤 🙏

henrifroese · 2020-12-31T10:04:55Z

TODO:

resolve conflict
refine explanation of parallelization for developers, with examples
make sure parallelization is explained in getting-started

mk2510 and others added 17 commits August 18, 2020 22:06

added MultiIndex DF support

fa342a9

suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <hf2000510@gmail.com>

beginning with tests

59a9f8c

implemented correct sparse support

19c52de

*missing: test adopting for new types Co-authored-by: Henri Froese <hf2000510@gmail.com>

Merge branch 'master_upstream' into change_representation_to_multicolumn

66e566c

added back list() and rm .tolist()

41f55a8

rm .tolist() and added list()

217611a

Adopted the test to the new dataframes

6a3b56d

wrong format

b8ff561

Address most review comments.

e3af2f9

Add more unittests for representation

77ad80e

start or paralisation

4c718b8

WIP Co-authored-by: Henri Froese <hf2000510@gmail.com>

implement parallel in some functions

4c7deb0

added private functions to preprosseing and enable parallel

8835847

Merge remote-tracking branch 'origin/decorator_for_parallelization' i…

4fd7b94

…nto decorator_for_parallelization

begin to add tests

a587768

added test and NLP implementation

e17a751

missing documentation Co-authored-by: Henri Froese <hf2000510@gmail.com>

changed _helper to helper

c0279b2

vercel bot deployed to Preview August 23, 2020 21:55 View deployment

henrifroese marked this pull request as draft August 23, 2020 22:11

move config variables to new config.py

83e534c

vercel bot deployed to Preview August 23, 2020 22:16 View deployment

fix config import in __init__

eda100a

vercel bot deployed to Preview August 23, 2020 22:19 View deployment

right parallel access

ca3a98f

vercel bot deployed to Preview August 23, 2020 22:43 View deployment

formatted code

127291d

vercel bot deployed to Preview August 23, 2020 22:45 View deployment

changed back imports from the functions as not needed

823bbb4

vercel bot deployed to Preview August 23, 2020 22:50 View deployment

henrifroese added the enhancement New feature or request label Aug 24, 2020

henrifroese marked this pull request as ready for review August 24, 2020 16:28

mk2510 mentioned this pull request Aug 25, 2020

tokenize with Spacy #131

Open

henrifroese mentioned this pull request Aug 28, 2020

👩‍💻 API next steps: checklist #85

Open

17 tasks

jbesomi mentioned this pull request Sep 14, 2020

Implement filter_extremes #169

Open

jbesomi marked this pull request as draft September 14, 2020 15:45

Merge branch 'master_upstream' into decorator_for_parallelization

a8e4994

vercel bot deployed to Preview September 22, 2020 08:13 View deployment

moved stem to nlp

8f77b08

vercel bot deployed to Preview September 22, 2020 08:17 View deployment

fixed test

b529181

vercel bot deployed to Preview September 22, 2020 08:45 View deployment

parallized noun chunks

22a8100

vercel bot deployed to Preview September 22, 2020 08:51 View deployment

mk2510 marked this pull request as ready for review September 22, 2020 08:52

henrifroese mentioned this pull request Jan 15, 2021

All preprocessing functions to receive as input TokenSeries #145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed-Up Preprocessing + NLP #162

Speed-Up Preprocessing + NLP #162

henrifroese commented Aug 23, 2020 •

edited by mk2510

henrifroese commented Aug 29, 2020

jbesomi commented Sep 1, 2020

mk2510 commented Sep 1, 2020

mk2510 commented Sep 22, 2020 •

edited

henrifroese commented Dec 31, 2020

Speed-Up Preprocessing + NLP #162

Are you sure you want to change the base?

Speed-Up Preprocessing + NLP #162

Conversation

henrifroese commented Aug 23, 2020 • edited by mk2510

henrifroese commented Aug 29, 2020

jbesomi commented Sep 1, 2020

mk2510 commented Sep 1, 2020

mk2510 commented Sep 22, 2020 • edited

henrifroese commented Dec 31, 2020

henrifroese commented Aug 23, 2020 •

edited by mk2510

mk2510 commented Sep 22, 2020 •

edited