-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fivefold faster tempo detection and spectrograms too #5899
base: master
Are you sure you want to change the base?
Fivefold faster tempo detection and spectrograms too #5899
Conversation
... Thus std::advance and std::distance will be O(1)
... Removing the old OpenMP experiment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Judging by the code style you are pulling in some of your older work.
I have only looked on lib-concurrency
. I think there few changes needed, but they are rather minor.
]]# | ||
|
||
set( SOURCES | ||
MessageBuffer.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect this class to be used anywhere except for the current Effects implementation. It makes very strict assumptions about the Reader/Writer and gives a fair amount of problems due to its' double buffering, which made 3.2 such a torture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was first written and used in scrubbing to communicate to the thread that fetches audio in response to mouse movements.
That file is just moved without change and does not need review.
|
||
set( SOURCES | ||
MessageBuffer.h | ||
spinlock.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class uses this_thread::yield
every second loop and doesn't use pause
. It makes sense, but not exactly correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That file too is just moved unchanged and not used in the new files.
@@ -6,6 +6,7 @@ set( LIBRARIES | |||
lib-string-utils | |||
lib-strings | |||
lib-utility | |||
lib-concurrency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This library is surely long needed.
*/ | ||
increment_type operator()(increment_type inc = {}) { | ||
if (abandoned.load(std::memory_order_relaxed)) | ||
throw detail::AbortException{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see absolutely no justification for using exceptions here. They are meant to signal exceptional situations. In the worst case, they can be used to get back deep into the call stack, but that always indicates serious problems with program design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative is to complicate how tasks are written, requiring them to test and exit.
It is simpler instead to let them use a possibly non-returning function, which also, in the future C++20 version perhaps, would instead become a coroutine suspension point. Then there could be the possibility of non-resumption or migration of a task to another thread in a more sophisticated task scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative is to complicate how tasks are written
I can have a task where I need to handle complications explicitly. I do have such use cases now, but not using this class, obviously.
I don't see how handling the result complicates anything.
if (abandoned.load(std::memory_order_relaxed)) | ||
throw detail::AbortException{}; | ||
const auto old_total = total.fetch_add(inc, std::memory_order_acq_rel); | ||
return old_total + inc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reporting progress is very complex in MT environments. I'm not sure that this value will be meaningful by the time the progress is reported.
// TODO: This function might be modified instead to allocate tasks to threads | ||
// from a pool, and the Task class might define other policies for | ||
// subdivision of its range | ||
n_threads = std::max(1u, n_threads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that the assertion is not the only check, but do we need the n_threads > 0
at all?
auto newDistance = distance + step; | ||
fraction += remainder; | ||
if (fraction >= n_threads) | ||
fraction -= n_threads, ++newDistance; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very concerned about this code. It must be rewritten without the comma.
// 0 | ||
size_t index = 0; | ||
for (auto &task : tasks) { | ||
std::thread{ ref(channels[index].first), std::ref(task), index + 1 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contrary to the previous place, I see very little reason to split this into multiple lines instead of using ++index right in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong -- index
is used twice in arguments of a constructor and order of evaluation is not specified. I must increment only in another statement.
size_t index = 0; | ||
for (auto &task : tasks) { | ||
std::thread{ ref(channels[index].first), std::ref(task), index + 1 } | ||
.detach(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say I like this, but I understand that this is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synchronizing on the get of a future of a packaged_task that runs on the thread should be equivalent to joining the thread.
{ | ||
assert(n_threads >= 1); | ||
// TODO: This function might be modified instead to allocate tasks to threads | ||
// from a pool, and the Task class might define other policies for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not having thread pools limits the usability of this function quite significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to distract myself yet with writing a thread pool too. I wanted to get to a demonstration of some real performance wins and also ease of use of the new utility in a few places.
Do we already have a usable thread pool in a third party library/
You have 12 threads and just a 5x speedup. The good question is why, but I probably need to examine the rest of the PR. |
}; | ||
using namespace Parallel; | ||
ForEach task{ first, last, calcOne }; | ||
OneDimensionalReduce(task); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this function may put CPU to 100% load during the playback. There must be more control over the thread priorities and count here.
Two old files were simply moved into lib-concurrency. The rest of the work is very recent. But it abstracts out for reuse the pattern that was applied in this experiment a year ago. #4141 |
Speedup won't be ideal. There should be some serial component. Or less than fully parallel, as in is the final merging step in which not all threads are doing work and there is inter-thread communication. You pointed out some cache ping-pong from false sharing. There is thread creation and destruction overhead because I didn't bother yet to invent or reuse a thread pool. There might be contention for a shared resource like the main memory bus or I/O. The estimated speedup was from @saintmatthieu 's regression test for tempo detection. In fact that test pulls entire files into big memory buffers first, then used to go sequentially through the memory, but now that can be fetched in a random-access way. The real-world tempo detection first creates a WaveClip object, and then this parallelized version through the |
Resolves: (direct link to the issue)
You read that right, fivefold.
At least ... on one computer, with the averages of a few before-and-after trials, with 6 cores and hyperthreading.
Results may vary with hardware but have great promise to scale as the processing power gets cheaper.
In all the recent performance improvements of tempo detection performance, the "obvious" thing wasn't done --
which is, to use multiple cores to exploit the "embarrassing" parallelism inherent in the algorithm.
This PR adds some template classes and functions in new lib-concurrency -- the hard part -- to make transformations of
such data-parallel computations comparatively easy to do.
Tempo detection is the first application -- unit tests pass, when
runLocally = true
in MirTestUtils.cpp, andtimeMeasurement.txt
then contains much smaller numbers.The computation of spectrograms for display is the example of a second, even easier application, in which one simply
supplies an iterator range and a (copyable) lambda, and the speedup follows. Try using the largest spectrogram window
size (327678) and compare before-and-after zooming-in and -out, scrolling, and playing with the pinned head. (But sorry,
no parallelism yet if you use the Reassignment algorithm.)
What are other possible easy applications? I think there are some in Sequence.cpp such as converting sample format, and
finding min/max/rms. Others?
Recommended: