Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python3.6 How to share A = ahocorasick.Automaton() in multiprocess ? #114

Open
feiyangzhang opened this issue Jul 26, 2019 · 14 comments
Open
Labels

Comments

@feiyangzhang
Copy link

No description provided.

@WojciechMula
Copy link
Owner

That's a good question. The module was not designed to be a thread-safe.

What's your use case?

@Salma7amed
Copy link

Salma7amed commented Feb 5, 2020

I have a multiprocessor program that spawns multiple processes to split a huge file among those processes, and then each processor's role is to try to look if the long string it's responsible for contains any of the possible substrings(which are the words added to the automaton).

So, in this case, is it safe to create the automaton on the start of the server, and then pass the instance to each processor?

@WojciechMula
Copy link
Owner

@Salma7amed I have no experience with multiprocess python software. The Aho-Corasick searching procedure does not modify the data, but fetching the value associated with a key is not thread safe as it bumps up reference counters.

@WojciechMula
Copy link
Owner

Guys, could you please provide some simple proof-of-concept multi-process script which would model your use case? As I mentioned, I have no experience with such kind of programs, but having an example I might check how the module behaves.

@OblackatO
Copy link

OblackatO commented Jul 27, 2020

There is a large difference between multi-processing and multi-threading in Python. @feiyangzhang it will be hard to literally share the Automaton object between processes. When Python creates a new process, it launches a new Python interpreter. This means that the memory address space of the child processes is not shared with the one of the parent . The only reason why the parent variables can still be accessed is because the memory of the parent is copied to the child, hence the variables too. If you change some variable on the child and try to access that same variable on the parent, you will notice that the parent process will still have the old value.

When it comes to threads, the threads share indeed the memory address space of the their parent. Hence, you can interchange the variables' values in both child threads and the parent (with perhaps sync methodologies, but that is another topic)

In most situations, there is no point in using pyahocorasick in threads, because of the Python GIL. For instance, the situation that @Salma7amed mentioned is a perfect example of this. The GIL ensures that only one thread can run at a time. Of course, they run concurrently, but not in parallel. The reason for this is because each thread needs to acquire a lock of the Python interpreter so they can get their code interpreted (and run). The searching process of the aho-corasick makes it CPU bound. This means that, if you have several threads trying to search something using pyahocorasick, one thread will always have to wait for the other (of course, putting aside CPU time scheduling algorithms which may grant different CPU times to each thread).
However, if we are talking about multi-processing, that will indeed work in parallel. Because each process has its own Python interpreter.

What an irony this is hein? Threads share memory but do not run in parallel. Processes run in parallel, but do not share memory.

Since python3.8, there is the possibility to use V shared memory. Yes, it is what you are thinking, the memory is literally shared among processes. However, in order to share memory, your object, data type, or whatever you want to share needs to be convertible to bytes.

Now, the question is @WojciechMula , is there anyway that you can make the Automaton convertible to bytes and from bytes to an Automaton again? This would be the real stuff.
If you can do this and solve the problem you mentioned above: "The Aho-Corasick searching procedure does not modify the data, but fetching the value associated with a key is not thread safe as it bumps up reference counters." , I will be down to implement the shareable memory part in python3.8 and make a pull request.
Also, if you need help for something else concerning this subject, I can help you.

@OblackatO
Copy link

OblackatO commented Jul 29, 2020

Here is an example of what it looks like to share an Automaton in several processes in python3.8, with the new V shared memory features.

import ahocorasick
import io
import pickle
from multiprocessing import Process, shared_memory


trie_patterns = mountains = ["THISISAMALICOUSIOC", "TEST_email@gmail.com", "another_email@hotmail.com", "27.185.77.23", "190.7.139.81", "8934:9ABC:F6A0:19F7:E8EC:A65F:43B3:AB56"]

A = ahocorasick.Automaton()
for x in trie_patterns:
    A.add_word(x, x)
A.make_automaton()
automaton_to_bytes = io.BytesIO()
pickle.dump(A, automaton_to_bytes)

# shared memory
name = "pyaho_test"
automaton_len = len(automaton_to_bytes.getvalue())
shm_a = shared_memory.SharedMemory(create=True, size=automaton_len, name=name)
shm_a.buf[:len(automaton_to_bytes.getvalue())] = automaton_to_bytes.getvalue()

def perform_search(to_search, automaton_len):
    shared_mem = shared_memory.SharedMemory(name)
    automaton_in_bytes = shared_mem.buf[:automaton_len]
    automaton_in_bytes = io.BytesIO(automaton_in_bytes)
    A = pickle.load(automaton_in_bytes)
    while True:
        for match in A.iter(to_search):
            r = match[1]

all_p = list()
try:
    for x in range(0, 20):
        p = Process(target=perform_search, args=("TEST_email@gmail.com", automaton_len))
        p.start()
        all_p.append(p)
    for p in all_p:
        p.join()
except KeyboardInterrupt:
    shm_a.unlink()

I was actually able to convert to and from bytes an Automaton.
Each process has a while True loop and the memory keeps increasing... Is this supposed to happen? Could it maybe be because of the reference counters? Anyway, the match is found and no errors occur.

@Dobatymo
Copy link

@OblackatO Doesn't this just make a copy from the shared memory for each process? Passing a copy of the automaton to each process would have the same effect no? I would expect that multiprocessing support would only keep the automaton in memory once.

@OblackatO
Copy link

OblackatO commented Jul 31, 2020

@Dobatymo Unfortunately, it seems it does copy the shared memory object, which is not what I understood from the documentation. I edited the code above to a more practical example, with a limited While loop on the children. Also, I built an Automaton with 30k patterns that I load from an external file, in this lines:

with open("strings_file_30k.json", "r") as all_strings:
    strings = json.loads(all_strings.read())
for string_category in strings:
    for string_sample in strings[string_category]:
        trie_patterns.append(string_sample)
        if total_added_to_search_in != total_to_search_in:
            to_search_in.append(string_sample)
            total_added_to_search_in += 1

Just load the "strings" variable with some other source, or randomly generate 30k words and add to the "strings" list and the code should run fine for you too.

If we look at the memory analysis of the code with memory_profiler, using the command:

mprof run --multiprocess pyahocorasick_example.py

(Where pyahocorasick_example.py is the file with the code above)

We see that:

Filename: pyahocorasick_example.py

Line #    Mem usage    Increment   Line Contents
================================================
    44     69.6 MiB     69.6 MiB   @profile
    45                             def perform_search(to_search_in, process_number, automaton_len):
    46     69.6 MiB      0.0 MiB       print(f'On process {process_number}')
    47     69.6 MiB      0.0 MiB       done_iterations = 0
    48     69.6 MiB      0.0 MiB       shared_mem = shared_memory.SharedMemory(name)
    49     69.7 MiB      0.1 MiB       automaton_in_bytes = shared_mem.buf[:automaton_len]
    50    122.5 MiB     52.8 MiB       automaton_in_bytes = io.BytesIO(automaton_in_bytes)
    51    191.9 MiB     69.4 MiB       A = pickle.load(automaton_in_bytes)
    52    191.9 MiB      0.0 MiB       while done_iterations != total_iterations:
    53    191.9 MiB      0.0 MiB           for match in A.iter(to_search_in[random.randint(0, len(to_search_in)-1)]):
    54    191.9 MiB      0.0 MiB               r = match[1]
    55    191.9 MiB      0.0 MiB           done_iterations += 1
    56    191.9 MiB      0.0 MiB       print(f'Done searching on process {process_number}')
    57    165.3 MiB      0.0 MiB       shared_mem.close()

This is the function where the children processes are created. It starts directly with 69.8MiB, which is the memory of the parent. Already here, there is something wrong. The Automaton should be around 30MiB, because there are 30k patterns. I don't know where the remaining 39MiB come from. We might argue that it is the shared_memory object + the Automaton object, which doubles the amount of memory. But I don't think so. If you look at the code I set the A = None and the Automaton is hopefully destroyed. (cannot control garbage collection) If I do the same memory analysis, without setting A = None, this happens:

Filename: pyahocorasick_example.py

Line #    Mem usage    Increment   Line Contents
================================================
    44    104.4 MiB    104.4 MiB   @profile
    45                             def perform_search(to_search_in, process_number, automaton_len):
    46    104.4 MiB      0.0 MiB       print(f'On process {process_number}')
    47    104.4 MiB      0.0 MiB       done_iterations = 0
    48    104.4 MiB      0.0 MiB       shared_mem = shared_memory.SharedMemory(name)
    49    104.5 MiB      0.1 MiB       automaton_in_bytes = shared_mem.buf[:automaton_len]
    50    157.3 MiB     52.8 MiB       automaton_in_bytes = io.BytesIO(automaton_in_bytes)
    51    226.8 MiB     69.5 MiB       A = pickle.load(automaton_in_bytes)
    52    226.8 MiB      0.0 MiB       while done_iterations != total_iterations:
    53    226.8 MiB      0.0 MiB           for match in A.iter(to_search_in[random.randint(0, len(to_search_in)-1)]):
    54    226.8 MiB      0.0 MiB               r = match[1]
    55    226.8 MiB      0.0 MiB           done_iterations += 1
    56    226.8 MiB      0.0 MiB       print(f'Done searching on process {process_number}')
    57    200.1 MiB      0.0 MiB       shared_mem.close()

So I can only assume that the Automaton is indeed garbage collected when I set it to None.

We can also see in both memory analysis that in line automaton_in_bytes = shared_mem.buf[:automaton_len] the memory does not increase, which proves the piece of shared memory of the parent where the Automaton was put, is indeed shared.
There is a ramp up in the lines: automaton_in_bytes = io.BytesIO(automaton_in_bytes) && A = pickle.load(automaton_in_bytes) , but this is expected. BytesIO just copies the bytes "automaton_in_bytes", and when the Automaton is pickled, I guess the same thing happens.
For Ctypes structures that are convertible to bytes, there is away around this with the function:

from_buffer(source[, offset])¶
This method returns a ctypes instance that shares the buffer of the source object. The source object must support the writeable buffer interface. The optional offset parameter specifies an offset into the source buffer in bytes; the default is zero. If the source buffer is not large enough a ValueError is raised.

What is happening in the io.BytesIO and in the Automaton pickling is something similar to this:

from_buffer_copy(source[, offset])
This method creates a ctypes instance, copying the buffer from the source object buffer which must be readable. The optional offset parameter specifies an offset into the source buffer in bytes; the default is zero. If the source buffer is not large enough a ValueError is raised.

I wonder if there is a way to make the Automaton class being usable with from_buffer function. After all pyahocorasick is written in C... I looked at the code already, but I cannot find an easy answer.

Finally, I have one last remark. I created an Automaton in one python file X and put the Automaton in shared memory. After that I created another python script Y and attached a piece of shared memory to the memory created in X. The code of X is similar to the code on my issue above, I just did not launch the processes and used input() after the shared memory creation and print the shared memory name.
The code of Y looks like this:

import sys 
import ahocorasick
import io
import pickle
import random
from multiprocessing import shared_memory
from memory_profiler import profile


@profile
def run_child_process():
    to_search_in = ["christopherbaker@hotmail.com", "mossjoseph@thompson-simpson.info"]
    shared_mem = shared_memory.SharedMemory(sys.argv[1])
    automaton_in_bytes = shared_mem.buf[:int(sys.argv[2])]
    automaton_in_bytes = io.BytesIO(automaton_in_bytes)
    A = pickle.load(automaton_in_bytes)
    total_iterations = 10000
    done_iterations = 0
    while done_iterations != total_iterations:
        for match in A.iter(to_search_in[random.randint(0, len(to_search_in)-1)]):
            r = match[1]
            #print(r)
        done_iterations += 1
    shared_mem.close()
    shared_mem.unlink()
    print(f'Done searching on process {sys.argv[3]}')
    sys.exit(0)

if __name__ == '__main__':
    run_child_process()

To call Y, I used:

python3.8 -m memory_profiler shared_mem_executor.py <shared_mem_object_name> <len_of_Automaton_in_bytes> <process_number>
python3.8 -m memory_profiler shared_mem_executor.py psm_2440913a 28029738 1

The memory analysis of shared_mem_executer.py looked like this:

Filename: shared_mem_executor.py

Line #    Mem usage    Increment   Line Contents
================================================
    16     35.3 MiB     35.3 MiB   @profile(stream=fp)
    17                             def run_child_process():
    18     35.3 MiB      0.0 MiB       to_search_in = ["christopherbaker@hotmail.com", "mossjoseph@thompson-simpson.info"]
    19     35.4 MiB      0.1 MiB       shared_mem = shared_memory.SharedMemory(sys.argv[1])
    20     35.4 MiB      0.0 MiB       automaton_in_bytes = shared_mem.buf[:int(sys.argv[2])]
    21     88.7 MiB     53.3 MiB       automaton_in_bytes = io.BytesIO(automaton_in_bytes)
    22    126.9 MiB     38.2 MiB       A = pickle.load(automaton_in_bytes)
    23    126.9 MiB      0.0 MiB       total_iterations = 10000
    24    126.9 MiB      0.0 MiB       done_iterations = 0
    25    126.9 MiB      0.0 MiB       while done_iterations != total_iterations:
    26    126.9 MiB      0.0 MiB           for match in A.iter(to_search_in[random.randint(0, len(to_search_in)-1)]):
    27    126.9 MiB      0.0 MiB               r = match[1]
    28                                         #print(r)
    29    126.9 MiB      0.0 MiB           done_iterations += 1
    30    100.2 MiB      0.0 MiB       shared_mem.close()
    31    100.2 MiB      0.0 MiB       shared_mem.unlink()
    32    100.2 MiB      0.0 MiB       print(f'Done searching on process {sys.argv[3]}')

The normal ramp ups in BytesIO and pickle. However, we start with 35MiB, which is barely the size of the Automaton. If the memory is really shared, should the child process have 35MiB at the beginning? Maybe the memory_profiler does not realize the shared_memory object and considers it like a none shared object. I am not really sure.

@Dobatymo What are your thoughts on this? Do you perhaps have another shared memory example with the Automaton with V shared memory of py3.8?

@Dobatymo
Copy link

Dobatymo commented Aug 3, 2020

I don't think the shared memory feature is of any help here. I don't know how the ahocorasick automaton is working internally, but I know it must be a complex graph structure which cannot be read easily from a blob of bytes. Pickle is doing serialization (which involves copy) of the data, and its result can be shared. But it cannot be used without deserializing (which involves copy) the data back to format which the library uses internally.
This is not a like a numpy array which can be used from a bunch of bytes easily. Supporting this functionally would probably require lots of changes from the pyahocorasick developer.

@OblackatO
Copy link

OblackatO commented Aug 5, 2020

Supporting this functionally would probably require lots of changes from the pyahocorasick developer.

That I was expecting, but any other way of doing it is just not worth it IMHO.

The Automaton can still be shard in threads, which will significantly reduce the programming complexity, but in CPU-bound scenarios this is inefficient and does not make sense.

There are of course other ways to share the Automaton in processes, without using V-Shared memory. For instance, using Pipes. But these are too slow and the Automaton needs to be convertible to bytes anyway and we will face the same problem of serialization/deserializing, like I showed in the memory analysis above.

To have a real CPU-bound shared memory solution that is efficient and fast, V-Shared memory is the way to go. Any other solution seems to me be like "hacks" not worth doing.

@Dobatymo
Copy link

Dobatymo commented Aug 6, 2020

a real CPU-bound shared memory solution

To be honest, you should be using a different language, not Python. Python is notoriously bad for this kind of stuff.

However, I can think of possible solutions, but both of them are pure speculation as I don't actually know how pyahocorasick is implemented.
I assume it uses a pointer based tree/graph structure in memory. So when loading/saving from/to disk, it will be converted to another kind of datastructure which doesn't use pointers. This datastructure however cannot be used in it's current implementation to implement the needed automaton methods. This means the memory model would have to be changed. Instead of using a pointer based approach, pyahocorasick would need to work with an index based (probably linear data structure like a vector/list) and do "graph path finding" using relative indices instead of absolute pointers. This way it could operate on a byte blob of data which could be shared with python. This would also enable pyahocorasick to mmap automaton files. I assume this would require a complete rewrite of the library.

For the second approach, one could implement a thread pool in C code and share the automaton in C level. Then a method could be expose to python to find words, which would use a queue to distribute the tasks to different threads (which can all access the same shared automaton)

But even then (for both approaches) there is still the issue with the reference counts. For that, probably a different data ownership model is needs (with explicit frees perhaps?).

Again I don't know if any of that makes sense, because I have not looked at the actual implemented (but I am familiar with the ahocorasick theory itself).

@WojciechMula maybe you can comment if this is accurate

@WojciechMula
Copy link
Owner

First of all, thanks a lot for such a great explanation.

I fully agree with @Dobatymo. Making data structure able to be placed in shared memory, we would need completely rewrite everything from the scratch. This is a new memory model. Likely we would up with two libraries: plain C (unaware of python), and python wrapper that would interpret indices are python objects.

TBH don't see a good solution for @Salma7amed problem using existing module. I mean there are two options: 1) true multiprocessing, but then we have data duplication, or 2) single data structure reises inside a server which is queried from multiple processes.

@pombredanne
Copy link
Collaborator

I use extensive multiprocessing with fairly large automatons (this is all public doe). These are pickled/unpickled and I surely could benefit from using less memory. :)

Could using a memory mapped file be another approach? that would require somehow that an on-disk and in-memory representation is the same which is likely not something that is easy to get and a completely different implementation.

@WojciechMula
Copy link
Owner

While making the data structure portable (index-based rather address-based) might not be that complex, making stored values portable seems to be difficult.

Now, when you are retrieving an associated value, it's just reference counter incremented. Everything lives inside the same python instance, this is why it's that simple. But if we refer to a value from multiple processes? Then a trie should keep pickled values and each use re-materialize python object with unpickle. Sounds like possible performance problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants