Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguity of xfel balance worker #856

Open
Baharis opened this issue Mar 13, 2023 · 6 comments
Open

Ambiguity of xfel balance worker #856

Baharis opened this issue Mar 13, 2023 · 6 comments
Assignees

Comments

@Baharis
Copy link
Contributor

Baharis commented Mar 13, 2023

Introduction

While working with a large dataset I discovered that the balance worker can act in a very unexpected and undesired ways. I'd like to discuss some of my issues and suggest potential improvements. Given how self-contained the code is, I would gladly take care of the "problems", but I would like to discuss which of my issues are actual "problems".

Issue 1: Balance worker, version global1, can be grossly inaccurate

The default balancing algorithm, accessible via phil parameter input.parallel_file_load.balance = global1 does not require any information to be shared between ranks. Every rank stripes its experiment lists and assigns n-th rank n-th stripe:

26 Expts: [ABCDEFGHIJKLMNOPQRSTUVWXYZ]
 (stripe)  ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
 5 Ranks: [01234012340123401234012340]

This works great as long as #ranks << #expts: In the example above, rank 0 should receive 30 experiments, while every other should receive 25. It gets worse with decreasing #expts/#ranks ratio. In particular, if a ranks has not enough expts for all other ranks, it will reshuffle the sliced lists. In this way the fact that every rank will receive some reflections is left to a sheer luck; at least I wasn't able to find any other safety mechanism.

 5 Expts: [ABCDE]
 (stripe)  ↓↓↓↓↓
10 Ranks: [01234]
 (shuffle) ↓↓↓↓↓
10 Ranks: [72841]

Lately I have been using very low expt/ranks ratio and the problem really shows. Even if none of the ranks have been assigned 0 expts (which can happen!), one can end up with something like this:

Rebalancing input load -- global1 method...
Experiments: (total,min,max): 951, 2, 19
Images:      (total,min,max): 951, 2, 19
Reflections: (total,min,max): 4900877, 3769, 103899

The simplest way to solve this issue is to allow rank 0 to collect information about len(experiments) from every rank. Using cumulative sums of these lengths, every rank could take into account which rank should it assign to first. The end product would be essentially identical to striping list of all experiments from all ranks, and would never allow for min and max to differ by more than 1.

Issue 2: Balance worker, version global1, alltoall slicing could be automated

Due to implementation reasons, mpi4py's alltoall method raises an OverflowError when a data of size > 2,147,483,647 should be redistributed. Although neither my experiment list nor reflection list is this long, this error is raised. The worker has a mechanism to protect against it, i.e. it can execute alltoall in several "slices". The number of slices is controlled via input.parallel_file_load.balance_mpi_alltoall_slices. However, I believe the need for this parameter could be easily eliminated by calling exchange_reflections_by_alltoall_sliced recursively with 1, 10, 100... slices if previous raises OverflowError.

Issue 3: Balance worker, version global2, can be grossly inefficient

Since global1 is so inaccurate, I switched to global2. This version limits rank communication by sending only those experiments that actually need to be transferred. It balances 100% accurately, but uses an essentially recursive algorithm of high complexity: move experiments one-by-one until everything is balanced. The same level of balance could be simplified by creating two lists: "[rank for _ in experiments]" and "np.linspace(0, #ranks, #expts, dtype=int)" (example) and mapping one list onto another to get the result. The end product might be slightly different and require marginally more communication, but the complexity would be essentially O(1). The question is, is it better to wait 2 minutes for rank0 to finish, or to allow up to #ranks more experiments to be passed around? Update: The slow step is mpi4py's alltoall, not balancing algorithm; see comments.

List 1: [0000012333]
         ↓↓↓↓↓↓↓↓↓↓
List 2: [0001122233]

Issue 4: Balance worker, version global2, does not accept alltoall slicing at all

For introduction, see issues 2 and 3. While balance worker global1 has a safety mechanism to protect against long experiment and reflection lists, global2's behavior is currently completely unaffected by input.parallel_file_load.balance_mpi_alltoall_slices parameter. While global2 is defined to minimize data transfer between nodes, it still utilizes alltoall to pass this information around and raises OverflowError when the input is too large. Thus it would be convenient to introduce the alltoall slicing capability to global2.

Issue 5: Balance worker, version per_node, is... undefined

Balance worker, version per_node, utilizes method distribute_over_ranks to pass around experiments and reflections. The problem is, in current implementation this method is completely undefined. As such, the method does not work at all. This might be a typo, but the file was last modified half year ago, and the fact that it remained unnoticed for this long means it isn't really used.

Issue 6: Balance worker has two names

I keep referring to the worker in question as balance, but in fact I am unsure whether it is called balance or load_balancer. Filenames are ambiguous, Python code leans towards load_balancer, but phil files use exclusively term balance. I think it would be nice to settle on one name for clarity and maintainability.

Issue 7: Does anyone else need sorting capability in balance?

For my project I need sort my experiments by experiment.imageset.path so that one ranks receives as little imagesets as possible. I can achieve this by writing a custom sort worker which would be dispatched before balance version global2, but this would require expts to be passed around twice. What is your opinion about this functionality being added to balance instead?

@Baharis
Copy link
Contributor Author

Baharis commented Mar 14, 2023

This is all the data I have on global2 execution times right now. Unfortunately it looks like I almost never run timing; however given the total time seems to depend mostly on input expt uniformity and not transferred files size or number, I came to assume that the execution time is mostly due to the distribution algorithm. All jobs below were performed on Perlmutter's AMD EPYC 7763 nodes with at least 2 processors per rank.

Job ID # Expts
total/min/max
# Ranks LB_SPLIT_LIST
step time
balance
total time
3641230 555 / 0 / 10 512 6 ms 1 s *
5971998 3044 / 0 / 29 192 N/A. 0 s **
5972076 3044 / 0 / 29 192 N/A. 0 s **
5973167 3044 / 0 / 29 192 N/A 1 s **
5975458 3044 / 0 / 29 192 N/A 0 s **
5975175 3044 / 0 / 29 192 N/A 0 s **
5962589 3945 / 0 / 564 320 5 ms N/A
5690290 36077 / 256 / 849 64 N/A 24 s
5310385 41356 / 0 / 2663 64 N/A 812 s
5312217 41356 / 0 / 2663 64 N/A 813 s
5652693 109469 / 6 / 7326 64 N/A 950 s
5668346 109469 / 6 / 7326 64 N/A 945 s
5658463 109469 / 6 / 7326 64 N/A 950 s
  • Marked with *: oddly enough, global2 did not balance correctly, i.e. max – min = 2 post-balance;
  • Marked with **: in these jobs each refl takes ~20x more memory, potentially increasing transfer times;

Please don't ask why I ran a job with 555 expts on 512 rank – it seems my goals are beyond anyone's understanding, including my own (・・。)ゞ

@Baharis
Copy link
Contributor Author

Baharis commented Mar 14, 2023

@irisdyoung I tested my own thesis put forward in issue 3 and realized that the algorithm is by no means recursive, and is much smarter and faster than what I initially understood. Since the fault is not in this part, why is global2 so slow?

@irisdyoung
Copy link
Contributor

That's good news from my perspective! I will add some timing info and hopefully get more information. It could be how the experiment and reflection sets are being sliced up (before passing them around, but after all ranks have received their instructions), or how they're aggregated after. I can try removing either of those steps (producing a nonfunctional algorithm; just for testing) to find out.

@Baharis
Copy link
Contributor Author

Baharis commented Mar 14, 2023

@irisdyoung Thanks, thats a thing to check. I started some test jobs with timing, so I might look into it.

@phyy-nx
Copy link
Contributor

phyy-nx commented Mar 14, 2023

If it helps, you can run it with debug.cProfile=True, then use a profiler on one of the per-rank .prof files to see what is going on.

@Baharis
Copy link
Contributor Author

Baharis commented Mar 14, 2023

Ok, after some timing I got some interesting conclusions:

  • I cannot reproduce execution times for the 14-minute jobs; each takes 30 sec now instead of 15 minutes. I will attribute this difference to Perlmutter instability;
  • I can consistently reproduce execution times for the 16-minute jobs:
    • Majority of balance execution time (over 935 seconds) is used in LB_EXPTS_AND_REFLS_ALLTOALL step; the LB_SPLIT_LIST step which I blamed initially takes below 1ms;
    • Out of these 935 seconds in LB_EXPTS_AND_REFLS_ALLTOALL, 934 are used by mpi4py.MPI.Comm.alltoall.
  • The same 16-minute balance job with global1 instead of global2 takes... 13 seconds.

So it looks like the grossly inefficient part here is alltoall. And for some reason, the alltoall call in global1 does not have the same issue as global2. The only difference between them as much as I'm aware is distributing expts and refls separately, so maybe I'll try that? Nevertheless, it seems this problem could be easily solved by performing alltoall in steps (see Issue 4), as alltoall seemingly has issues with very long input.

EDIT: Distributing expts and refls separately does not solve the issue; both steps in total take exactly the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants