Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(server): refresh external assets piecemeal #7934

Open
wants to merge 82 commits into
base: main
Choose a base branch
from

Conversation

etnoy
Copy link
Contributor

@etnoy etnoy commented Mar 13, 2024

There have recently been improvements that help refreshing large external libraries. However, it is still capped at some limit, depending on available memory and the size of the library.

This PR lets us remove the upper limit of library refreshes. We previously kept a list of all assets in a library being refreshed because we would track which assets have been removed. This is the list that kept eating all memory. We no longer build a huge list from the filesystem crawl. Instead, we request them 5000 at a time and then discard when the corresponding refresh jobs have been queued.

This, however, means the library refresh jobs can't detect when a library asset goes offline, or when an offline asset goes back online. We move these actions to a separate job (but WIP maybe we can run it inline with normal refresh jobs??)

Quick testing indicated I was able to refresh millions of assets without a sweat on a VM with only 10G of ram. Even larger numbers are possible with even less memory; I just haven't bothered to test more ;)

image

@etnoy etnoy added 🗄️server external-library Issues related to external libraries labels Mar 13, 2024
Copy link

cloudflare-pages bot commented Mar 13, 2024

Deploying immich with  Cloudflare Pages  Cloudflare Pages

Latest commit: b3bb5c5
Status: ✅  Deploy successful!
Preview URL: https://36ac0a6f.immich.pages.dev
Branch Preview URL: https://feat-offline-files-job.immich.pages.dev

View logs

@etnoy etnoy changed the title feat(server): add job to check for offline files feat(server): refresh external assets piecemeal Mar 15, 2024
@etnoy etnoy marked this pull request as ready for review March 15, 2024 23:43
@etnoy etnoy requested review from jrasm91 and mertalev March 15, 2024 23:43
server/src/domain/library/library.service.ts Outdated Show resolved Hide resolved
server/src/domain/library/library.service.ts Outdated Show resolved Hide resolved
server/e2e/jobs/specs/library.e2e-spec.ts Outdated Show resolved Hide resolved
@etnoy
Copy link
Contributor Author

etnoy commented Mar 20, 2024

Would it be better to still have the offline and online stuff as separate jobs, but just queue both of them? That approach was a lot faster and used less memory when I tested it.

No, because for a large refresh it can potentially take a long time before either offline or online starts running. By interweaving the queue you make it feel snappier to the user. In my opinion.

server/e2e/jobs/specs/library.e2e-spec.ts Show resolved Hide resolved
server/src/domain/library/library.service.ts Outdated Show resolved Hide resolved
server/src/domain/library/library.service.ts Outdated Show resolved Hide resolved
server/src/domain/library/library.service.ts Outdated Show resolved Hide resolved
server/src/domain/library/library.service.ts Outdated Show resolved Hide resolved
server/src/domain/library/library.service.ts Outdated Show resolved Hide resolved
server/src/domain/library/library.service.ts Outdated Show resolved Hide resolved
Comment on lines 657 to 673
const checkIfOnlineAssetsAreOffline = async () => {
const existingAssetPage = await onlineAssets.next();
existingAssetsDone = existingAssetPage.done ?? true;

if (existingAssetPage.value) {
existingAssetCounter += existingAssetPage.value.length;
this.logger.log(
`Queuing online check of ${existingAssetPage.value.length} asset(s) in library ${library.id}...`,
);
await this.jobRepository.queueAll(
existingAssetPage.value.map((asset: AssetEntity) => ({
name: JobName.LIBRARY_CHECK_OFFLINE,
data: { id: asset.id, importPaths: validImportPaths },
})),
);
}
}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you already put it in its own function, why keep it in there and not move it out into its own private function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's simply because we need to use variables in the parent context

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just pass the values? IMO arrow functions should be (as well as possible) independent and not have any side effects.

server/src/domain/library/library.service.ts Outdated Show resolved Hide resolved
server/src/interfaces/asset.repository.ts Outdated Show resolved Hide resolved
server/src/services/library.service.ts Outdated Show resolved Hide resolved
this.logger.debug(`Found ${assetIdsToMarkOffline.length} offline asset(s) previously marked as online`);
await this.assetRepository.updateAll(assetIdsToMarkOffline, { isOffline: true });
}
while (!crawlDone) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit weird. We should be able to use a normal (async) for loop here instead.

Copy link
Contributor Author

@etnoy etnoy Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally a for loop would be preferable. I switched to a while loop because I wanted to do this in batches of 1000 at a time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Why can't you do

for await (const crawlResult of crawledAssets) {
...
}

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code needs to do two tasks, and to make the system feel more responsive I don't want to wait for the first (potentially very long) job to be worked on before the next jobs start. We therefore interleave the two jobs in between

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reply does not make any sense to me. Doesn't this semantically do the exact same thing?

await this.scanAssets(job.id, batch, library.ownerId, job.refreshAllFiles ?? false);
batch = [];
if (!existingAssetsDone) {
// Interweave the queuing of offline checks with the asset scanning (if any)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of this? If anything it seems like it would be less efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two things happening in parallel. One queue checks the list of crawled assets on the file system. The other queue checks if files in the db are still online. In order to speed up the percieved quickness of the system I am picking 1000 from each queue at a time. This makes the system feel snappier because it doesn't wait for all files to finish scanning before noticing an existing file being offline, and vice versa. This final loop checks if there are more things to pick up from the "is this asset still online?" queue.

Hope this makes sense

server/src/services/library.service.ts Show resolved Hide resolved
danieldietzler
danieldietzler previously approved these changes Apr 5, 2024
Copy link
Member

@danieldietzler danieldietzler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has already gotten much more readable, thanks!
And sorry that PR got stale lol

server/e2e/jobs/specs/library.e2e-spec.ts Show resolved Hide resolved
server/e2e/jobs/specs/library.e2e-spec.ts Show resolved Hide resolved
server/e2e/jobs/specs/library.e2e-spec.ts Show resolved Hide resolved
server/e2e/jobs/specs/library.e2e-spec.ts Show resolved Hide resolved
server/src/services/library.service.spec.ts Outdated Show resolved Hide resolved
server/src/services/library.service.spec.ts Outdated Show resolved Hide resolved
this.logger.debug(`Found ${assetIdsToMarkOffline.length} offline asset(s) previously marked as online`);
await this.assetRepository.updateAll(assetIdsToMarkOffline, { isOffline: true });
}
while (!crawlDone) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Why can't you do

for await (const crawlResult of crawledAssets) {
...
}

?

let batch = [];
for (const assetPath of crawledAssetPaths) {
batch.push(assetPath);
if (crawledAssetPaths.length % LIBRARY_SCAN_BATCH_SIZE === 0 || crawlDone) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we move that whole if outside the loop? Once we quit that loop we inherently are done with the scanning, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we aren't done with the scanning after the loop, it also needs to check if any existing assets need to be checked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. You're implying that batching does not give you all results?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am also just very confused... This whole interleaving those two jobs may make it more performant but it definitely adds a lot of complexity my brain apparently can't comprehend.

@danieldietzler danieldietzler dismissed their stale review April 5, 2024 14:01

I didn't mean to approve this yet oops

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external-library Issues related to external libraries 🗄️server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants