Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ax is not not starting as many workers as I'd like to; sometimes, get_next_trials returns 0 new trials #2301

Open
NormanTUD opened this issue Mar 25, 2024 · 6 comments
Assignees
Labels
fixready Fix has landed on master. in progress question Further information is requested

Comments

@NormanTUD
Copy link
Contributor

NormanTUD commented Mar 25, 2024

Hi,

I really like ax for optimizing hyperparameters. Based on it, I have written a tool for hyperparameter optimization, but I stumble upon a problem.

We use Slurm and submitit for our cluster and it all works fine, except for one thing. The number of parallel "workers" (ie. the number of parallel running jobs) does barely ever reach the maximum specified in my script.

The problem lies in the "ax_client.get_next_trials"-function. I do a loop like this:

new_jobs_needed = min(args.num_parallel_jobs - len(jobs), max_eval - submitted_jobs)
for m in range(0, new_jobs_needed):
    trial_index_to_param, _ = ax_client.get_next_trials(max_trials=1)

I've tried max_trials=args.max_trials (coming from argparse) as well, but the behaviour is the same.

Sometimes, the trial_index_to_param is empty. There are just 0 entries in it.

I've tried the following:

experiment_args = {                                   
       "name": experiment_name,
        "parameters": experiment_parameters,              
        "objectives": {"result": ObjectiveProperties(minimize=minimize_or_maximize)},
        "choose_generation_strategy_kwargs": {            
            "num_trials": max_eval,                       
            "num_initialization_trials": args.num_parallel_jobs,
            "use_batch_trials": True,                     
            "max_parallelism_override": args.num_parallel_jobs                                                                                                                                                                                                                                                      
        },                                                
}                                                     
            
experiment = ax_client.create_experiment(**experiment_args)                                                                                                                                                                                                                                                 

But still, sometimes the number of results coming from get_next_trials is empty and has 0 entries. Using use_batch_trials or not doesn't have any difference there as far as I know.

image

This is done in 10 minute slots, and as you can see, in the beginning there are many completed jobs, almost 90 per 10-minute-slot. But later on, it gets less and less, every time because the length of the trial_index_to_param is 0.

Is there anything I can do more against this? How may I use the full number of parallel evaluation specified?

Thanks!

Edit: tried adding enforce_sequential_optimization=False to the choose_generation_strategy_kwargs, but that doesn't change anything also.

@mgarrard mgarrard self-assigned this Mar 29, 2024
@mgarrard mgarrard added question Further information is requested in progress labels Mar 29, 2024
@NormanTUD
Copy link
Contributor Author

NormanTUD commented Mar 31, 2024

https://github.com/NormanTUD/OmniOpt/tree/main/ax

Main script:

https://github.com/NormanTUD/OmniOpt/blob/main/ax/.omniopt.py

Maybe for anyone looking through the environment the problem is appearing in, my general plan is to allow this:

./omniopt --partition=alpha --experiment_name=example --mem_gb=1 --time=60 --worker_timeout=60 --max_eval=500 --num_parallel_jobs=500 --gpus=1 --follow --run_program=ZWNobyAiUkVTVUxUOiAlKHBhcmFtKSI= --parameter param range 0 1000 float

and to run that optimization on our clusters and to use ax/botorch internally for hyper parameter optimization. We have basically unlimited resources for free (university) and want to have as many workers in parallel as possible to gain from the HPC as much as possible in finding good hyperparameters for every type of problem or just researching those areas (depending on what your program does).

On the top of the code is a large comment showing some things I tried, the list is anything but complete though.

It would really be appreciated by us if you helped us with that.

Yours sincerly

NormanTUD

@mgarrard
Copy link
Contributor

mgarrard commented Apr 1, 2024

Hi @NormanTUD! Thanks so much for engaging with our tool - happy to help. Could you provide the logs from AxClient for your experiment? These logs usually contain information about the trial generation and generation strategy that will be helpful for us debugging the issue.

Also good catch on "use_batch_trials" not having an effect. This code hasn't been opensourced yet (hopefully soon!), so it isn't doing anything at this time. Let me raise an error to make that more clear.

mgarrard added a commit to mgarrard/Ax that referenced this issue Apr 12, 2024
Summary:
This is a follow up to facebook#2301

The user was trying to use batch trials, but we don't currently expose this via AxClient we want to add an error to let users know this isn't really having any affect.

Reviewed By: saitcakmak

Differential Revision: D56048665
mgarrard added a commit to mgarrard/Ax that referenced this issue Apr 12, 2024
Summary:

This is a follow up to facebook#2301

The user was trying to use batch trials, but we don't currently expose this via AxClient we want to add an error to let users know this isn't really having any affect.

Reviewed By: saitcakmak

Differential Revision: D56048665
mgarrard added a commit to mgarrard/Ax that referenced this issue Apr 13, 2024
Summary:

This is a follow up to facebook#2301

The user was trying to use batch trials, but we don't currently expose this via AxClient we want to add an error to let users know this isn't really having any affect.

Reviewed By: saitcakmak

Differential Revision: D56048665
facebook-github-bot pushed a commit that referenced this issue Apr 15, 2024
Summary:
Pull Request resolved: #2355

This is a follow up to #2301

The user was trying to use batch trials, but we don't currently expose this via AxClient we want to add an error to let users know this isn't really having any affect.

Reviewed By: saitcakmak

Differential Revision: D56048665

fbshipit-source-id: 7dff08492e5cf52ab71579d9dcaac24beded4ff9
@mgarrard
Copy link
Contributor

@NormanTUD -- added a PR for an error to populate with use_batch_trials, it'll be live once we cut a new release :)

Let me know if you have the logs from AxClient for additional support. Thanks!

@NormanTUD
Copy link
Contributor Author

NormanTUD commented Apr 22, 2024

Hi,

thanks for your reply. I was on vacation and as such, didn't code anything. But currently, I am trying to get all logs now. Thanks for the patience. I will update this post when I have the logs.

First a bit of my own debugging code:

Update #1:

1531                                 trial_index_to_param, _ = ax_client.get_next_trials(                                                                              
1532                                     max_trials=1                                                                                                                  
1533                                 )                                                                                                                                 
1534                                                                                                                                                                   
1535                                 print_debug(f"Got {len(trial_index_to_param.items())} new items (m = {m}, in range(0, {calculated_max_trials})).")                

These lines are only executed when there are new jobs to be generated (in a for loop for further testing instead of by changing max_trials= to the number of new trials, it's set to 1, but in a for loop for each new job). But sometimes, I get this:

2024-03-26 11:14:13: Got 0 new items (m = 0, in range(0, 33)).

So it just returns 0 jobs.

These are the number of workers over time:

17
7
5
8

(No time given there though, it's in each generative loop)

It should be around ~20, so 17 is fine for a snapshot during starting the jobs, but over time, it gets much less.

The only message I can see from ax that seems relevant seems to be this:

ax.models.torch.botorch_modular.acquisition: 
Encountered Xs pending for some Surrogates but observed for others. Considering 
these points to be pending.

@mgarrard mgarrard added the fixready Fix has landed on master. label Apr 22, 2024
@NormanTUD
Copy link
Contributor Author

NormanTUD commented Apr 29, 2024

I've seen the tag "fixready" and installed it from the latest version (via pip/github). I cannot see any changes in behaviour, it looks exactly like before. I am not entirely sure whether this tag should imply that the fix is in the master already, but if it is, it hasn't changed anything for me.

Problem seems to be that the generation_node.generator_run_limit() returns 0, even though it shouldn't return 0. Not sure why yet, though.

Edit: debugged it a bit more. Having 30 workers in parallel, gives me this and thus returns 0:

generation_node.generator_run_limit: criterion = MaxTrials({'threshold': 30, 'only_in_statuses': None, 'not_in_statuses': [<TrialStatus.FAILED: 2>, <TrialStatus.ABANDONED: 5>], 'transition_to': 'GenerationStep_1', 'block_transition_if_unmet': True, 'block_gen_if_met': True}), this_threshold: 0

I changed the function to this in modelbridge/generation_node.py:

 489     def generator_run_limit(self, supress_generation_errors: bool = True) -> int:
 490         """How many generator runs can this generation strategy generate right now,
 491         assuming each one of them becomes its own trial. Only considers
 492         `transition_criteria` that are TrialBasedCriterion.
 493                                                         
 494         Returns:                                        
 495               - the number of generator runs that can currently be produced, with -1
 496                 meaning unlimited generator runs,       
 497         """                                             
 498         # TODO @mgarrard remove filter when legacy usecases are updated
 499         valid_criterion = []                            
 500         for criterion in self.transition_criteria:      
 501             if criterion.criterion_class not in {       
 502                 "MinAsks",                              
 503                 "RunIndefinitely",                      
 504             }:                                          
 505                 myprint(f"generator_run_limit: adding class {criterion.criterion_class} to criterion")
 506                 valid_criterion.append(criterion)       
 507                                                         
 508         myprint(f"generator_run_limit: valid_criterion: {valid_criterion}")
 509         # TODO: @mgarrard Should we consider returning `None` if there is no limit?
 510         # TODO:@mgarrard Should we instead have `raise_generation_error`? The name
 511         # of this method doesn't suggest that it would raise errors by default, since
 512         # it's just finding out the limit according to the name. I know we want the
 513         # errors in some cases, so we could call the flag `raise_error_if_cannot_gen` or
 514         # something like that : )                       
 515         trial_based_gen_blocking_criteria = [           
 516             criterion                                   
 517             for criterion in valid_criterion            
 518             if criterion.block_gen_if_met and isinstance(criterion, TrialBasedCriterion)
 519         ]                                               
 520         """                                             
 521         gen_blocking_criterion_delta_from_threshold = [ 
 522             criterion.num_till_threshold(               
 523                 experiment=self.experiment, trials_from_node=self.trials_from_node
 524             )                                           
 525             for criterion in trial_based_gen_blocking_criteria
 526         ]                                               
 527         """                                         
  528                                                         
 529         gen_blocking_criterion_delta_from_threshold = []                                                                                                                                                                                                              
 530                                                         
 531         for criterion in trial_based_gen_blocking_criteria:
 532             this_threshold = criterion.num_till_threshold(
 533                 experiment=self.experiment, trials_from_node=self.trials_from_node
 534             )                                           
 535                                                         
 536             myprint(f"generator_run_limit: criterion = {criterion}, this_threshold: {this_threshold}")
 537                                                         
 538             gen_blocking_criterion_delta_from_threshold.append(this_threshold)
 539                                                         
 540         myprint(f"generator_run_limit: gen_blocking_criterion_delta_from_threshold: {gen_blocking_criterion_delta_from_threshold}")
 541                                                         
 542         # Raise any necessary generation errors: for any met criterion,
 543         # call its `block_continued_generation_error` method The method might not
 544         # raise an error, depending on its implementation on given criterion, so the
 545         # error from the first met one that does block continued generation, will be
 546         # raised.                                       
 547         if not supress_generation_errors:               
 548             for criterion in trial_based_gen_blocking_criteria:
 549                 # TODO[mgarrard]: Raise a group of all the errors, from each gen-
 550                 # blocking transition criterion.        
 551                 if criterion.is_met(                    
 552                     self.experiment, trials_from_node=self.trials_from_node
 553                 ):                                      
 554                     criterion.block_continued_generation_error(
 555                         node_name=self.node_name,       
 556                         model_name=self.model_to_gen_from_name,
 557                         experiment=self.experiment,     
 558                         trials_from_node=self.trials_from_node,
 559                     )                                   
 560         if len(gen_blocking_criterion_delta_from_threshold) == 0:
 561             if not self.gen_unlimited_trials:           
 562                 logger.warning(                         
 563                     "Even though this node is not flagged for generation of unlimited "
 564                     "trials, there are no generation blocking criterion, therefore, "
 565                     "unlimited trials will be generated."
 566                 )                                       
 567             myprint(f"generator_run_limit: returning -1 (no limit)")
 568             return -1                                   
 569         res = min(gen_blocking_criterion_delta_from_threshold)
 570         myprint(f"generator_run_limit: returning res {res}")
 571         return res                

myprint just adds the filename in front of it, so I can debug it more easily.

I am not sure why trials have failed, nor why some are abandoned, but in total, the this_threshold variable gives me 0, and it gets chosen as the number of new parameters to be created.

I also tried monkey patching it:

from unittest.mock import patch

def patched_generator_run_limit(*args, **kwargs):
    return 1

with patch('ax.modelbridge.generation_node.GenerationNode.generator_run_limit', new=patched_generator_run_limit):
    trial_index_to_param, _ = ax_client.get_next_trials(max_trials=1)

around the get_next_trials(max_trials=1)-function, but then I get this exception:

All trials for current model have been generated, but not enough data has been
observed to fit next model. Try again when more data are available.

Adding min_trials_observed=1 to the model=Models.BOTORCH_MODULAR step in the GenerationStrategy didn't help there, it didn't go away.

Is there anything else I may provide?

Yours sincerly,

NormanTUD

@NormanTUD
Copy link
Contributor Author

NormanTUD commented May 8, 2024

I have made a breakthrough regarding the reason why I don't get so many workers!

When the job failed, I needed to do:

                    _trial = ax_client.get_trial(trial_index)
                    _trial.mark_failed()
                    ax_client.log_trial_failure(trial_index=trial_index)

and when it succeeded I needed to do:

                    _trial.mark_completed(unsafe=True)

This way, ax knows about the jobs being finished (or failed), and it doesn't block the generation of new points anymore then, regarding to max_parallelism.

This was, admittedly, previously unclear to me.

Now it finally works pretty much as I like it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixready Fix has landed on master. in progress question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants