Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

temp_folder and mmap_mode parameters in Parallel #1373

Open
PJPRoche opened this issue Dec 20, 2022 · 2 comments
Open

temp_folder and mmap_mode parameters in Parallel #1373

PJPRoche opened this issue Dec 20, 2022 · 2 comments

Comments

@PJPRoche
Copy link

I encountered an error as I increased the number or rows in my training set and have isolated the issue down to the triggering of the automated memory mapping in the job lib Parallel class. I think the issue is caused by the write permissions on the folder where Parallel is looking.

I can see from the documentation (https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html) that setting JOBLIB_TEMP_FOLDER as an environment variable (I am using JOBLIB_TEMP_FOLDER=/tmp) is one way to specify where to share memory with worker processes. However it seems to not be sufficient to just set this environment variable. I have also tried specifying directly the temp_folder="/tmp" in the Parallel class instantiation, but that gives the same error.

However, what does seem to work is to specify both the temp_folder and mmap_mode options, but only with mmap_mode set to "r+" or "w".

THIS DOES NOT WORK
Parallel(n_jobs=-1, temp_folder="/tmp")(processes)

THIS DOES WORK
Parallel(n_jobs=-1, temp_folder="/tmp", mmap_mode="r+")(processes)

Questions:

  1. Is it correct that both the temp_folder and mmap_mode need to be set together when there is no existing file?
  2. If so, is this not problematic when joblib Parallel is a dependency of another third-party module? This is the situation I have and means I cannot edit the input parameters going into Parallel(). I can only set the environment variable, but if that is insufficient by itself to set what is required, then I don't see how to make this work ...

Original value error output

ValueError: assignment destination is read-only
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/databricks/python/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 428, in _process_worker
    r = call_item()
  File "/databricks/python/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 275, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/databricks/python/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 620, in __call__
    return self.func(*args, **kwargs)
  File "/databricks/python/lib/python3.8/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/databricks/python/lib/python3.8/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-2d8ce7d7-4206-43c2-9fd3-0aad944e643a/lib/python3.8/site-packages/ctgan/data_transformer.py", line 112, in _transform_continuous
    data[column_name] = data[column_name].to_numpy().flatten()
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/frame.py", line 3163, in __setitem__
    self._set_item(key, value)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/frame.py", line 3243, in _set_item
    NDFrame._set_item(self, key, value)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/generic.py", line 3832, in _set_item
    NDFrame._iset_item(self, loc, value)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/generic.py", line 3821, in _iset_item
    self._mgr.iset(loc, value)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1110, in iset
    blk.set_inplace(blk_locs, value_getitem(val_locs))
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 363, in set_inplace
    self.values[locs] = values
ValueError: assignment destination is read-only

@ogrisel
Copy link
Contributor

ogrisel commented Feb 18, 2023

Is it correct that both the temp_folder and mmap_mode need to be set together when there is no existing file?

No they have different purpose. temp_folder makes it possible to choose where the temporary memory mapped files will be created. By default it's using a shared memory folder (I think it's /run/shm on linux for instance).

mmap_mode makes it possible to change the mode. You should probably never use mmap_mode="r+" because it means that you allow one worker to corrupt the input data of another worker processing the same argument concurrently.

Instead I would advise to either:

  • use mmap_mode="c" (copy on write shared memory) but this is known to crash from time to time on windows
  • change your code to never do inplace modification of the input arguments of the parallel function.
  • disable memory mapping entirely (max_nbytes=None) if you do not expect to pass the same larger data arguments to different iterations of the parallel function (in which case you won't saver any memory by using the automatic memory mapping feature of joblib).

@ogrisel
Copy link
Contributor

ogrisel commented Feb 18, 2023

If so, is this not problematic when joblib Parallel is a dependency of another third-party module? This is the situation I have and means I cannot edit the input parameters going into Parallel(). I can only set the environment variable, but if that is insufficient by itself to set what is required, then I don't see how to make this work ...

Yes it is and is being fixed in: #1392.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants