Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group tasks from a generator #5

Open
gsouf opened this issue Oct 5, 2021 · 5 comments
Open

Group tasks from a generator #5

gsouf opened this issue Oct 5, 2021 · 5 comments

Comments

@gsouf
Copy link

gsouf commented Oct 5, 2021

Is your feature request related to a problem?

I have a large dataset that I want to process as a grouped set of tasks. task.group forces me to create 1 function per task to run, it is slow and no efficient for memory.

Describe the solution you'd like

It would be more efficient to have something like task.map that takes a generator or an array of data, as well as a function to execute for each record.

example:

const data = function*() {
  yield 'a';
  yield 'b';
  yield 'c';
};

// or
// data = ['a', 'b', 'c'];

task.map(data, (task, item)=> task(item, () => 
  // so something
}));

Task.map can also manage concurrency and invoke the next function only when a seat is freed:

const data = function*() {
  yield 'a';
  yield 'b';
  yield 'c';
};
task.map(
  data, 
  (task, item)=> task(item, () => 
  // so something
  }),
  {concurrency: 2}
);

I'm happy to send a PR for this

@privatenumber
Copy link
Owner

Thanks for the suggestion.

I'm not sure I see the benefit of building an API like that though.

I think you can already do the same thing using native API.

With an array of data, you can use Array#map:

const data = ['a', 'b', 'c'];

await task.group(
    task => data.map(
        item => task(`Working on ${item}`, async ({ setTitle }) => {
            await someAsyncTask()
        })
    ),
    {
        concurrency: 2
    }
)

In the case of generators, since the generator doesn't send all the values at once, I don't think a task group can be used. Task Groups are better for when you know all the tasks but they're queued.

With generators, you can do:

const data = function*() {
    yield 'a';
    yield 'b';
    yield 'c';
};

for (const item of data()) {
    task(`Working on ${item}`, async ({ setTitle }) => {
        await someAsyncTask()
    })
}

@gsouf
Copy link
Author

gsouf commented Oct 5, 2021

@privatenumber what you propose is the problem I explained at the beginning of the issue.

I have to create a new callback for each task, instead of reusing the same one for each record. Additionally I don't know in advance how many records I will generate. In my case I have thousands of them and they all run using the same function.

If we can depend on the concurrency of the task system to spawn function as spots are freed then we can write simple code that performs well without the need to create our own concurrency system as that's something already managed by tasuku.

Do you get the idea?

My current code with a big array is a bit slow and that causes issues in the terminal where tasuku spawns thousands of pending line (1 by task). What I want is that he spawns 1 line per task currently running in the map function, when a task completes then it invokes the next one, and so on...

@privatenumber
Copy link
Owner

privatenumber commented Oct 5, 2021

I have to create a new callback for each task, instead of reusing the same one for each record.

This is just a performance optimization, right? I'm curious if this is actually a measurable difference. Behind the scenes, it's using React via Ink, and I know they are quite memory inefficient (high allocate/de-allocate) by design.

If we can depend on the concurrency of the task system to spawn function as spots are freed then we can write simple code that performs well without the need to create our own concurrency system as that's something already managed by tasuku.

I prefer not to add async/concurrency control to keep tasuku minimal and focused on rendering the task. I think that can be done with another module.

import PQueue from 'p-queue'

const queue = new PQueue({ concurrency: 2 })

for (const item of generator()) {
	queue.add(
		() => task(`Working on ${item}`, async ({ setTitle }) => {
        	await someAsyncTask()
    	})
	);
}

Does that meet your needs?

My current code with a big array is a bit slow and that causes issues in the terminal where tasuku spawns thousands of pending line (1 by task). What I want is that he spawns 1 line per task currently running in the map function, when a task completes then it invokes the next one, and so on...

And you want these tasks to be running with a limited concurrency too, correct?

It might make more sense to simply add a hidePendingTasks option to task.group.

Alternatively, you can directly pass into p-map, which tasuku uses under the hood:

import pMap from 'p-map';

const sites = [
	getWebsiteFromUsername('sindresorhus'), //=> Promise
	'https://avajs.dev',
	'https://github.com'
];

const mapper = site => task(`Working on ${site}`, async ({ setTitle }) => {
     await someAsyncTask()
 })

const result = await pMap(sites, mapper, {concurrency: 2});

console.log(result);

Seems p-map also accepts an iterable, which might address your first problem with the generator.

@gsouf
Copy link
Author

gsouf commented Oct 6, 2021

@privatenumber thanks for the suggestions. I'll check them out shortly.

I just think it would be nice to be able to reuse the concurrency feature of tasuku instead of using p-queue. Also if you want to keep tasuku small and focused, then do the features task.group as well as the concurency param make sense at all? Because you can also replace them with other vendors.


One more thing - just so you know, when I run a dataset of 500 records with

task.group(task => data.map(item => task('title', async () => await doSomething(item)));

takes around 50 seconds (including around 4~5 seconds of overhead from the rest of my code).

Then I run the exact same code, with the only exception that I remove tasuku and I map doSomething asyncs calls into an array that I await Promise.all ex:

await Promise.all(data.map(doSomething));

It takes less than 7 seconds (including the 4~5 seconds of prior overhead).

So there is definitively a bottleneck somewhere here.

I only did quick tests for the moment as I have been short in time, I'll try to come back with a reproducible case as soon as I find a moment

@privatenumber
Copy link
Owner

The primarily feature of task.group is that it shows the pending tasks, which isn't accomplishable otherwise. But the functionality is pretty much all p-map, so it doesn't necessarily add code if you're already using it.


Interesting. I haven't had a case where I needed to run that many tasks but that's definitely a problem.

Thanks, looking forward to the reproduction!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants