Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fivefold faster tempo detection and spectrograms too #5899

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

Paul-Licameli
Copy link
Collaborator

@Paul-Licameli Paul-Licameli commented Jan 28, 2024

Resolves: (direct link to the issue)

You read that right, fivefold.

At least ... on one computer, with the averages of a few before-and-after trials, with 6 cores and hyperthreading.

Results may vary with hardware but have great promise to scale as the processing power gets cheaper.

In all the recent performance improvements of tempo detection performance, the "obvious" thing wasn't done --
which is, to use multiple cores to exploit the "embarrassing" parallelism inherent in the algorithm.

This PR adds some template classes and functions in new lib-concurrency -- the hard part -- to make transformations of
such data-parallel computations comparatively easy to do.

Tempo detection is the first application -- unit tests pass, when runLocally = true in MirTestUtils.cpp, and timeMeasurement.txt then contains much smaller numbers.

The computation of spectrograms for display is the example of a second, even easier application, in which one simply
supplies an iterator range and a (copyable) lambda, and the speedup follows. Try using the largest spectrogram window
size (327678) and compare before-and-after zooming-in and -out, scrolling, and playing with the pinned head. (But sorry,
no parallelism yet if you use the Reassignment algorithm.)

What are other possible easy applications? I think there are some in Sequence.cpp such as converting sample format, and
finding min/max/rms. Others?

  • I signed CLA
  • The title of the pull request describes an issue it addresses
  • If changes are extensive, then there is a sequence of easily reviewable commits
  • Each commit's message describes its purpose and effects
  • There are no behavior changes unnecessary for the stated purpose of the PR

Recommended:

  • Each commit compiles and runs on my machine without known undesirable changes of behavior

@Paul-Licameli Paul-Licameli added spectral tools spectrogram, spectral brush, etc performance Improve speed of operations tempo detection labels Jan 28, 2024
@Paul-Licameli Paul-Licameli self-assigned this Jan 28, 2024
@Paul-Licameli Paul-Licameli added this to the Audacity 3.5 milestone Jan 28, 2024
Copy link
Contributor

@crsib crsib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging by the code style you are pulling in some of your older work.

I have only looked on lib-concurrency. I think there few changes needed, but they are rather minor.

]]#

set( SOURCES
MessageBuffer.h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect this class to be used anywhere except for the current Effects implementation. It makes very strict assumptions about the Reader/Writer and gives a fair amount of problems due to its' double buffering, which made 3.2 such a torture.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was first written and used in scrubbing to communicate to the thread that fetches audio in response to mouse movements.

That file is just moved without change and does not need review.


set( SOURCES
MessageBuffer.h
spinlock.h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class uses this_thread::yield every second loop and doesn't use pause. It makes sense, but not exactly correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That file too is just moved unchanged and not used in the new files.

@@ -6,6 +6,7 @@ set( LIBRARIES
lib-string-utils
lib-strings
lib-utility
lib-concurrency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This library is surely long needed.

*/
increment_type operator()(increment_type inc = {}) {
if (abandoned.load(std::memory_order_relaxed))
throw detail::AbortException{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see absolutely no justification for using exceptions here. They are meant to signal exceptional situations. In the worst case, they can be used to get back deep into the call stack, but that always indicates serious problems with program design.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is to complicate how tasks are written, requiring them to test and exit.

It is simpler instead to let them use a possibly non-returning function, which also, in the future C++20 version perhaps, would instead become a coroutine suspension point. Then there could be the possibility of non-resumption or migration of a task to another thread in a more sophisticated task scheduler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is to complicate how tasks are written

I can have a task where I need to handle complications explicitly. I do have such use cases now, but not using this class, obviously.

I don't see how handling the result complicates anything.

if (abandoned.load(std::memory_order_relaxed))
throw detail::AbortException{};
const auto old_total = total.fetch_add(inc, std::memory_order_acq_rel);
return old_total + inc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reporting progress is very complex in MT environments. I'm not sure that this value will be meaningful by the time the progress is reported.

// TODO: This function might be modified instead to allocate tasks to threads
// from a pool, and the Task class might define other policies for
// subdivision of its range
n_threads = std::max(1u, n_threads);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that the assertion is not the only check, but do we need the n_threads > 0 at all?

auto newDistance = distance + step;
fraction += remainder;
if (fraction >= n_threads)
fraction -= n_threads, ++newDistance;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very concerned about this code. It must be rewritten without the comma.

// 0
size_t index = 0;
for (auto &task : tasks) {
std::thread{ ref(channels[index].first), std::ref(task), index + 1 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contrary to the previous place, I see very little reason to split this into multiple lines instead of using ++index right in place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong -- index is used twice in arguments of a constructor and order of evaluation is not specified. I must increment only in another statement.

size_t index = 0;
for (auto &task : tasks) {
std::thread{ ref(channels[index].first), std::ref(task), index + 1 }
.detach();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I like this, but I understand that this is safe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synchronizing on the get of a future of a packaged_task that runs on the thread should be equivalent to joining the thread.

{
assert(n_threads >= 1);
// TODO: This function might be modified instead to allocate tasks to threads
// from a pool, and the Task class might define other policies for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not having thread pools limits the usability of this function quite significantly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to distract myself yet with writing a thread pool too. I wanted to get to a demonstration of some real performance wins and also ease of use of the new utility in a few places.

Do we already have a usable thread pool in a third party library/

@crsib
Copy link
Contributor

crsib commented Jan 29, 2024

At least ... on one computer, with the averages of a few before-and-after trials, with 6 cores and hyperthreading.

You have 12 threads and just a 5x speedup. The good question is why, but I probably need to examine the rest of the PR.

};
using namespace Parallel;
ForEach task{ first, last, calcOne };
OneDimensionalReduce(task);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this function may put CPU to 100% load during the playback. There must be more control over the thread priorities and count here.

@Paul-Licameli
Copy link
Collaborator Author

Judging by the code style you are pulling in some of your older work.

I have only looked on lib-concurrency. I think there few changes needed, but they are rather minor.

Two old files were simply moved into lib-concurrency. The rest of the work is very recent. But it abstracts out for reuse the pattern that was applied in this experiment a year ago. #4141

@Paul-Licameli
Copy link
Collaborator Author

At least ... on one computer, with the averages of a few before-and-after trials, with 6 cores and hyperthreading.

You have 12 threads and just a 5x speedup. The good question is why, but I probably need to examine the rest of the PR.

Speedup won't be ideal. There should be some serial component. Or less than fully parallel, as in is the final merging step in which not all threads are doing work and there is inter-thread communication.

You pointed out some cache ping-pong from false sharing.

There is thread creation and destruction overhead because I didn't bother yet to invent or reuse a thread pool.

There might be contention for a shared resource like the main memory bus or I/O.

The estimated speedup was from @saintmatthieu 's regression test for tempo detection. In fact that test pulls entire files into big memory buffers first, then used to go sequentially through the memory, but now that can be fetched in a random-access way.

The real-world tempo detection first creates a WaveClip object, and then this parallelized version through the MirAudioReader subclass will fetch sample blocks out of it, requiring database access.

@Paul-Licameli Paul-Licameli removed this from the Audacity 3.5 milestone Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improve speed of operations spectral tools spectrogram, spectral brush, etc tempo detection
Projects
Status: Ready for Review
Development

Successfully merging this pull request may close these issues.

None yet

2 participants