Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspiciously low User and Sys time #597

Open
moncefmechri opened this issue Dec 29, 2022 · 5 comments
Open

Suspiciously low User and Sys time #597

moncefmechri opened this issue Dec 29, 2022 · 5 comments

Comments

@moncefmechri
Copy link

moncefmechri commented Dec 29, 2022

I am benchmarking the mold linker on an in-house application, using hyperfine 1.15.0 (from conda-forge) to drive the benchmarks.

While the wall-clock times reported seem correct, the "User" and "Sys" times seem way too low (and also are not impacted by the number of threads used). Example:

Benchmark 1: mold180 (1 threads)
  Time (mean ± σ):     24.302 s ±  0.477 s    [User: 0.392 s, System: 0.234 s]
  Range (min … max):   23.836 s … 24.789 s    3 runs

Benchmark 2: mold180 (2 threads)
  Time (mean ± σ):     14.871 s ±  0.597 s    [User: 0.385 s, System: 0.246 s]
  Range (min … max):   14.351 s … 15.523 s    3 runs

The benchmark invokes the linker by calling ninja MyBinary (with --setup and --prepare used to make sure ninja does not much else than invoking the linker)

Specs of the machine used to run the benchmark:

  • OS: Ubuntu 20.04 with kernel 5.4
  • CPU: Dual socket Intel Xeon Gold 5218
  • RAM: 128GB
@sharkdp
Copy link
Owner

sharkdp commented Jan 8, 2023

Please see https://stackoverflow.com/a/556411/704831. User and system times only refer to the time that the benchmarked process itself spent in user/system mode. Ninja calls out to subprocesses (the linker) which takes the majority of the time. So what you see above is only the time that Ninja itself spends in user/system mode.

If you are interested in the user/system times for the linker, maybe you can benchmark it directly?

I realize that this can be confusing. We should probably document it. Now that I'm thinking about it some more, I'm also not sure if that works correctly with and without the intermediate shell (--shell=…). And if it is consistent across operating systems. This should be investigated, (consolidated) and documented.

@moncefmechri
Copy link
Author

moncefmechri commented Jan 8, 2023

I think there might be something more subtle at play here.

First of all, for some reason, when linking using lld instead of mold, the user and system times look a lot more reasonable, even when going through ninja.

If I skip ninja and invoke g++ explicitly, I see again a similar pattern. The following runs both use 1 thread:

Benchmark 1: lld
  Time (mean ± σ):     10.933 s ±  0.029 s    [User: 9.437 s, System: 1.493 s]
  Range (min … max):   10.912 s … 10.953 s    2 runs
  
Benchmark 1: mold
  Time (mean ± σ):     12.026 s ±  0.028 s    [User: 0.013 s, System: 0.002 s]
  Range (min … max):   12.006 s … 12.045 s    2 runs

The only difference in the command line is how the linker is specified. For lld, I pass to the compiler -fuse-ld=lld, while for mold I pass -B/path/to/mold/folder. I can't think of how this would impact the results.

I also ran both commands with strace -f. Both traces look almost identical. With both linkers:

  • With a vfork() + execve(), the compiler invokes collect2
  • With another vfork() + execve(), collect2 invokes the linker

Once collect2 has invoked the linker, the main difference I see in the strace -f outputs between lld and mold is that mold ends up calling clone(), and the actual work seems to be done in that clone()-d process, while lld does the work in its main process.

The call to clone() in the strace -f output for mold:

[pid 36238] clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f33768d0ad0) = 36239

Maybe that's a clue toward the real answer.

Note that I am not the only one to have seen this behavior. In this blog post, one of the lld devs uses hyperfine to compare lld and mold. You can see the same pattern there.

Finally, I will add that time behaves similarly for both linkers:

With lld:

real    0m10.922s
user    0m9.469s
sys     0m1.452s

With mold:

real    0m12.083s
user    0m0.008s
sys     0m0.008s

@sharkdp
Copy link
Owner

sharkdp commented Jan 9, 2023

Thank you for the detailed answer. I sent you that stackoverflow link but didn't read it carefully myself - sorry. This also explains my confusion ("I'm also not sure if that works correctly with and without the intermediate shell").

The reality is a bit more complicated, as you said. Typically, user and sys times of child processes are collected. Which is why hyperfines measurement usually works with and without the intermediate shell.

However, user and sys times of child processes that did not finish in time (when the main processes ended) are not counted. This is described in this comment:

The time spent by child processes and their descendants count only when the time could have been collected by wait(2) or waitpid(2), and recursively with descendants. This implies that the child processes must have terminated. For instance, compare time sh -c 'foo & sleep 1' and time sh -c 'foo & sleep 2', where foo is a command that takes CPU time between 1 and 2 seconds. The former outputs something around 0.

We can demonstrate this using stress --cpu 1 --timeout 2, which spends most of its time in "user" mode.. for two seconds:

▶ time sh -c 'stress --cpu 1 --timeout 2 --quiet & sleep 1'
[…]  0,00s user 0,00s system 0% cpu 1,004 total

~                                                                                                      
▶ time sh -c 'stress --cpu 1 --timeout 2 --quiet & sleep 3'
[…]  2,00s user 0,00s system 66% cpu 3,004 total

The same thing can also be demonstrated with hyperfine:

▶ hyperfine \
    'stress --cpu 1 --timeout 2 --quiet & sleep 1' \
    'stress --cpu 1 --timeout 2 --quiet & sleep 3'

Benchmark 1: stress --cpu 1 --timeout 2 --quiet & sleep 1
  Time (mean ± σ):      1.001 s ±  0.000 s    [User: 0.001 s, System: 0.000 s]
  Range (min … max):    1.001 s …  1.001 s    10 runs
 
Benchmark 2: stress --cpu 1 --timeout 2 --quiet & sleep 3
  Time (mean ± σ):      3.001 s ±  0.000 s    [User: 1.996 s, System: 0.001 s]
  Range (min … max):    3.001 s …  3.002 s    10 runs

Note that the 2 seconds in user mode are only reported in the lower example, where the waiting time is long enough. In the first example, a user time of 0 is reported.

Now the question remains why this happens in your example. Could it be that one of the tools spawns children but then doesn't wait for them?

@moncefmechri
Copy link
Author

Thank you for your reply. I will try to figure out what mold is doing in the next few days, and will report back

@sharkdp
Copy link
Owner

sharkdp commented Apr 17, 2023

@moncefmechri Any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants