Ambiguity of xfel balance worker #856

Baharis · 2023-03-13T23:21:08Z

Introduction

While working with a large dataset I discovered that the balance worker can act in a very unexpected and undesired ways. I'd like to discuss some of my issues and suggest potential improvements. Given how self-contained the code is, I would gladly take care of the "problems", but I would like to discuss which of my issues are actual "problems".

Issue 1: Balance worker, version `global1`, can be grossly inaccurate

The default balancing algorithm, accessible via phil parameter input.parallel_file_load.balance = global1 does not require any information to be shared between ranks. Every rank stripes its experiment lists and assigns n-th rank n-th stripe:

26 Expts: [ABCDEFGHIJKLMNOPQRSTUVWXYZ]
 (stripe)  ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
 5 Ranks: [01234012340123401234012340]

This works great as long as #ranks << #expts: In the example above, rank 0 should receive 30 experiments, while every other should receive 25. It gets worse with decreasing #expts/#ranks ratio. In particular, if a ranks has not enough expts for all other ranks, it will reshuffle the sliced lists. In this way the fact that every rank will receive some reflections is left to a sheer luck; at least I wasn't able to find any other safety mechanism.

 5 Expts: [ABCDE]
 (stripe)  ↓↓↓↓↓
10 Ranks: [01234]
 (shuffle) ↓↓↓↓↓
10 Ranks: [72841]

Lately I have been using very low expt/ranks ratio and the problem really shows. Even if none of the ranks have been assigned 0 expts (which can happen!), one can end up with something like this:

Rebalancing input load -- global1 method...
Experiments: (total,min,max): 951, 2, 19
Images:      (total,min,max): 951, 2, 19
Reflections: (total,min,max): 4900877, 3769, 103899

The simplest way to solve this issue is to allow rank 0 to collect information about len(experiments) from every rank. Using cumulative sums of these lengths, every rank could take into account which rank should it assign to first. The end product would be essentially identical to striping list of all experiments from all ranks, and would never allow for min and max to differ by more than 1.

Issue 2: Balance worker, version `global1`, `alltoall` slicing could be automated

Due to implementation reasons, mpi4py's alltoall method raises an OverflowError when a data of size > 2,147,483,647 should be redistributed. Although neither my experiment list nor reflection list is this long, this error is raised. The worker has a mechanism to protect against it, i.e. it can execute alltoall in several "slices". The number of slices is controlled via input.parallel_file_load.balance_mpi_alltoall_slices. However, I believe the need for this parameter could be easily eliminated by calling exchange_reflections_by_alltoall_sliced recursively with 1, 10, 100... slices if previous raises OverflowError.

Issue 3: Balance worker, version `global2`, can be grossly inefficient

Since global1 is so inaccurate, I switched to global2. This version limits rank communication by sending only those experiments that actually need to be transferred. It balances 100% accurately, but uses an essentially recursive algorithm of high complexity: move experiments one-by-one until everything is balanced. The same level of balance could be simplified by creating two lists: "[rank for _ in experiments]" and "np.linspace(0, #ranks, #expts, dtype=int)" (example) and mapping one list onto another to get the result. The end product might be slightly different and require marginally more communication, but the complexity would be essentially O(1). The question is, is it better to wait 2 minutes for rank0 to finish, or to allow up to #ranks more experiments to be passed around? Update: The slow step is mpi4py's alltoall, not balancing algorithm; see comments.

List 1: [0000012333]
         ↓↓↓↓↓↓↓↓↓↓
List 2: [0001122233]

Issue 4: Balance worker, version `global2`, does not accept `alltoall` slicing at all

For introduction, see issues 2 and 3. While balance worker global1 has a safety mechanism to protect against long experiment and reflection lists, global2's behavior is currently completely unaffected by input.parallel_file_load.balance_mpi_alltoall_slices parameter. While global2 is defined to minimize data transfer between nodes, it still utilizes alltoall to pass this information around and raises OverflowError when the input is too large. Thus it would be convenient to introduce the alltoall slicing capability to global2.

Issue 5: Balance worker, version `per_node`, is... undefined

Balance worker, version per_node, utilizes method distribute_over_ranks to pass around experiments and reflections. The problem is, in current implementation this method is completely undefined. As such, the method does not work at all. This might be a typo, but the file was last modified half year ago, and the fact that it remained unnoticed for this long means it isn't really used.

Issue 6: Balance worker has two names

I keep referring to the worker in question as balance, but in fact I am unsure whether it is called balance or load_balancer. Filenames are ambiguous, Python code leans towards load_balancer, but phil files use exclusively term balance. I think it would be nice to settle on one name for clarity and maintainability.

Issue 7: Does anyone else need sorting capability in balance?

For my project I need sort my experiments by experiment.imageset.path so that one ranks receives as little imagesets as possible. I can achieve this by writing a custom sort worker which would be dispatched before balance version global2, but this would require expts to be passed around twice. What is your opinion about this functionality being added to balance instead?

The text was updated successfully, but these errors were encountered:

Baharis · 2023-03-14T02:58:00Z

This is all the data I have on global2 execution times right now. Unfortunately it looks like I almost never run timing; however given the total time seems to depend mostly on input expt uniformity and not transferred files size or number, I came to assume that the execution time is mostly due to the distribution algorithm. All jobs below were performed on Perlmutter's AMD EPYC 7763 nodes with at least 2 processors per rank.

Job ID	# Expts total/min/max	# Ranks	LB_SPLIT_LIST step time	balance total time
3641230	555 / 0 / 10	512	6 ms	1 s *
5971998	3044 / 0 / 29	192	N/A.	0 s **
5972076	3044 / 0 / 29	192	N/A.	0 s **
5973167	3044 / 0 / 29	192	N/A	1 s **
5975458	3044 / 0 / 29	192	N/A	0 s **
5975175	3044 / 0 / 29	192	N/A	0 s **
5962589	3945 / 0 / 564	320	5 ms	N/A
5690290	36077 / 256 / 849	64	N/A	24 s
5310385	41356 / 0 / 2663	64	N/A	812 s
5312217	41356 / 0 / 2663	64	N/A	813 s
5652693	109469 / 6 / 7326	64	N/A	950 s
5668346	109469 / 6 / 7326	64	N/A	945 s
5658463	109469 / 6 / 7326	64	N/A	950 s

Marked with *: oddly enough, global2 did not balance correctly, i.e. max – min = 2 post-balance;
Marked with **: in these jobs each refl takes ~20x more memory, potentially increasing transfer times;

^{_{Please don't ask why I ran a job with 555 expts on 512 rank – it seems my goals are beyond anyone's understanding, including my own (・・。)ゞ}}

Baharis · 2023-03-14T20:19:43Z

@irisdyoung I tested my own thesis put forward in issue 3 and realized that the algorithm is by no means recursive, and is much smarter and faster than what I initially understood. Since the fault is not in this part, why is global2 so slow?

irisdyoung · 2023-03-14T20:44:22Z

That's good news from my perspective! I will add some timing info and hopefully get more information. It could be how the experiment and reflection sets are being sliced up (before passing them around, but after all ranks have received their instructions), or how they're aggregated after. I can try removing either of those steps (producing a nonfunctional algorithm; just for testing) to find out.

Baharis · 2023-03-14T20:53:53Z

@irisdyoung Thanks, thats a thing to check. I started some test jobs with timing, so I might look into it.

phyy-nx · 2023-03-14T20:58:59Z

If it helps, you can run it with debug.cProfile=True, then use a profiler on one of the per-rank .prof files to see what is going on.

Baharis · 2023-03-14T23:28:36Z

Ok, after some timing I got some interesting conclusions:

I cannot reproduce execution times for the 14-minute jobs; each takes 30 sec now instead of 15 minutes. I will attribute this difference to Perlmutter instability;
I can consistently reproduce execution times for the 16-minute jobs:
- Majority of balance execution time (over 935 seconds) is used in LB_EXPTS_AND_REFLS_ALLTOALL step; the LB_SPLIT_LIST step which I blamed initially takes below 1ms;
- Out of these 935 seconds in LB_EXPTS_AND_REFLS_ALLTOALL, 934 are used by mpi4py.MPI.Comm.alltoall.
The same 16-minute balance job with global1 instead of global2 takes... 13 seconds.

So it looks like the grossly inefficient part here is alltoall. And for some reason, the alltoall call in global1 does not have the same issue as global2. The only difference between them as much as I'm aware is distributing expts and refls separately, so maybe I'll try that? Nevertheless, it seems this problem could be easily solved by performing alltoall in steps (see Issue 4), as alltoall seemingly has issues with very long input.

EDIT: Distributing expts and refls separately does not solve the issue; both steps in total take exactly the same time.

Baharis assigned Baharis and irisdyoung Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambiguity of xfel balance worker #856

Ambiguity of xfel balance worker #856

Baharis commented Mar 13, 2023 •

edited

Baharis commented Mar 14, 2023 •

edited

Baharis commented Mar 14, 2023

irisdyoung commented Mar 14, 2023

Baharis commented Mar 14, 2023

phyy-nx commented Mar 14, 2023

Baharis commented Mar 14, 2023 •

edited

Ambiguity of xfel balance worker #856

Ambiguity of xfel balance worker #856

Comments

Baharis commented Mar 13, 2023 • edited

Introduction

Issue 1: Balance worker, version global1, can be grossly inaccurate

Issue 2: Balance worker, version global1, alltoall slicing could be automated

Issue 3: Balance worker, version global2, can be grossly inefficient

Issue 4: Balance worker, version global2, does not accept alltoall slicing at all

Issue 5: Balance worker, version per_node, is... undefined

Issue 6: Balance worker has two names

Issue 7: Does anyone else need sorting capability in balance?

Baharis commented Mar 14, 2023 • edited

Baharis commented Mar 14, 2023

irisdyoung commented Mar 14, 2023

Baharis commented Mar 14, 2023

phyy-nx commented Mar 14, 2023

Baharis commented Mar 14, 2023 • edited

Baharis commented Mar 13, 2023 •

edited

Issue 1: Balance worker, version `global1`, can be grossly inaccurate

Issue 2: Balance worker, version `global1`, `alltoall` slicing could be automated

Issue 3: Balance worker, version `global2`, can be grossly inefficient

Issue 4: Balance worker, version `global2`, does not accept `alltoall` slicing at all

Issue 5: Balance worker, version `per_node`, is... undefined

Baharis commented Mar 14, 2023 •

edited

Baharis commented Mar 14, 2023 •

edited