Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] immich microservice memory leak kills host #9414

Open
1 of 3 tasks
Dunky-Z opened this issue May 13, 2024 · 16 comments
Open
1 of 3 tasks

[Bug] immich microservice memory leak kills host #9414

Dunky-Z opened this issue May 13, 2024 · 16 comments

Comments

@Dunky-Z
Copy link

Dunky-Z commented May 13, 2024

The bug

Today, I encountered the same issue #5283 . After bulk importing around 10,000 photos(By setting the external library), I suddenly couldn't remotely connect to my NAS host. Initially, I suspected that the ML service was causing high memory usage during facial recognition, so I forcibly restarted the NAS and reconfigured concurrency settings, setting all jobs to run on a single thread. I didn't start all the jobs at once; instead, I sequentially ran FACE DETECTION and then GENERATE THUMBNAILS. There were no abnormalities during the FACE DETECTION process, indicating that the issue wasn't caused by the ML operations. After completing all FACE DETECTIONS, I initiated the GENERATE THUMBNAILS task, which ran slowly due to the large image sizes. Without continuous monitoring, upon returning to check on the NAS, I found it was unreachable remotely, with the system log displaying the following error messages:

[17452.279485] Out of memory: Killed process 24344 (immich_microser) total-vm:14375660kB, anon-rss:8827768kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:23516kB oom_score_adj:0
[17653.570542] Out of memory: Killed process 24972 (immich_microser) total-vm:13119600kB, anon-rss:8825796kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:20652kB oom_score_adj:0
[17848.345248] Out of memory: Killed process 26162 (immich_microser) total-vm:19086052kB, anon-rss:8709080kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:31688kB oom_score_adj:0
[17932.668353] Out of memory: Killed process 27348 (immich_microser) total-vm:13774836kB, anon-rss:8759228kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:22648kB oom_score_adj:0
[18001.034451] Out of memory: Killed process 27902 (immich_microser) total-vm:14749840kB, anon-rss:9111036kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:25104kB oom_score_adj:0

This suggests that there might be a memory leak issue within the immich_microser service. I would greatly appreciate any advice on how to address this problem.

The OS that Immich Server is running on

Debian(OMV)

Version of Immich Server

OpenMediaVault 5

Version of Immich Mobile App

v1.103.1

Platform with the issue

  • Server
  • Web
  • Mobile

Your docker-compose.yml content

#
# WARNING: Make sure to use the docker-compose.yml of the current release:
#
# https://github.com/immich-app/immich/releases/latest/download/docker-compose.yml
#
# The compose file on main may not be compatible with the latest release.
#

name: immich

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    command: ['start.sh', 'immich']
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
      - "/srv/share/album:/mnt/media/share/album:ro"
    env_file:
      - .env
    ports:
      - 2283:3001
    depends_on:
      - redis
      - database
    restart: always

  immich-microservices:
    container_name: immich_microservices
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    # extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/hardware-transcoding
    #   file: hwaccel.transcoding.yml
    #   service: cpu # set to one of [nvenc, quicksync, rkmpp, vaapi, vaapi-wsl] for accelerated transcoding
    command: ['start.sh', 'microservices']
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
      - "/srv/share/album:/mnt/media/share/album:ro"
    env_file:
      - .env
    depends_on:
      - redis
      - database
    restart: always

  immich-machine-learning:
    container_name: immich_machine_learning
    # For hardware acceleration, add one of -[armnn, cuda, openvino] to the image tag.
    # Example tag: ${IMMICH_VERSION:-release}-cuda
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}
    # extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/ml-hardware-acceleration
    #   file: hwaccel.ml.yml
    #   service: cpu # set to one of [armnn, cuda, openvino, openvino-wsl] for accelerated inference - use the `-wsl` version for WSL2 where applicable
    volumes:
      - "/srv/appdata/immich/immich-cache:/cache"
    env_file:
      - .env
    restart: always

  redis:
    container_name: immich_redis
    image: registry.hub.docker.com/library/redis:6.2-alpine@sha256:51d6c56749a4243096327e3fb964a48ed92254357108449cb6e23999c37773c5
    restart: always

  database:
    container_name: immich_postgres
    image: registry.hub.docker.com/tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
    volumes:
      - "/srv/appdata/immich/immich-postgresql/data:/var/lib/postgresql/data"
    restart: always

Your .env content

# You can find documentation for all the supported env variables at https://immich.app/docs/install/environment-variables

# The location where your uploaded files are stored
UPLOAD_LOCATION=./library

# The Immich version to use. You can pin this to a specific version like "v1.71.0"
IMMICH_VERSION=release

# Connection secret for postgres. You should change it to a random password
DB_PASSWORD=postgres

# The values below this line do not need to be changed
###################################################################################
DB_HOSTNAME=immich_postgres
DB_USERNAME=postgres
DB_DATABASE_NAME=immich

REDIS_HOSTNAME=immich_redis

Reproduction steps

1. Add an external library that contains the over 10000 pics
2. Wait for immich to generate thumbnails for the photo

Relevant log output

No response

Additional information

No response

@Dunky-Z Dunky-Z changed the title immich microservice memory leak kills host [Bug] immich microservice memory leak kills host May 13, 2024
@mertalev
Copy link
Contributor

How much RAM does the server have, and how much is used just after starting thumbnail generation? An increase in memory usage for some period of time during thumbnail generation is normal because of memory fragmentation, but it should plateau after a certain point. If you have any RAW images, they will amplify this effect since they require much more memory.

@Dunky-Z
Copy link
Author

Dunky-Z commented May 13, 2024

Thank you for your response. The server has a total of 16GB of RAM, and its typical daily memory usage hovers around 6GB. After initiating thumbnail generation, it doesn't immediately consume all memory but gradually fills it up until the program crashes. Your mention of RAW images did remind me that the majority of my gallery consists of RAW files. Is there a way to prevent excessive memory usage in this scenario? I have tried limiting the microservice using Cgroups, but when the limit is exceeded, the microservice restarts. Since I bind the Cgroup settings to the PID, once the microservice restarts, PID changed, the previous Cgroup configurations become ineffective.

@Dunky-Z
Copy link
Author

Dunky-Z commented May 13, 2024

Given the server's average performance, I have configured all my JOBS to use a single thread, and only one JOB is executed at a time. In theory, processing a single RAW file shouldn't require such an excessive amount of memory.

@mertalev
Copy link
Contributor

Hmm, that much of an increase is unexpected. It could be related to the issue behind #6542, or possibly #4391. Are there any errors in the microservices logs before the OOM error?

As far as limiting memory usage in the meantime, you can set a limit through Docker, which will force the container to be restarted after reaching a certain usage.

@Dunky-Z
Copy link
Author

Dunky-Z commented May 13, 2024

Below are some logs from the microservices; perhaps they might provide some insight:

[Nest] 7  - 05/13/2024, 11:34:38 AM     LOG [EventRepository] WebSocket server initialized.
[Nest] 7  - 05/13/2024, 11:35:06 AM   ERROR [JobService] Failed to execute job handler (thumbnailGeneration/generate-preview): Error: The input file contains an unsupported image format
[Nest] 7  - 05/13/2024, 11:35:06 AM   ERROR [JobService] Error: The input file contains an unsupported image format
    at Sharp.toFile (/usr/src/app/node_modules/sharp/lib/output.js:89:19)
    at MediaRepository.resize (/usr/src/app/dist/repositories/media.repository.js:76:14)
    at MediaService.generateThumbnail (/usr/src/app/dist/services/media.service.js:156:48)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async MediaService.handleGeneratePreview (/usr/src/app/dist/services/media.service.js:134:29)
    at async /usr/src/app/dist/services/job.service.js:149:36
    at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28)
    at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)
[Nest] 7  - 05/13/2024, 11:35:06 AM   ERROR [JobService] Affected Job ID:
{
  "id": "5802ddbb-4e01-47fb-b21b-73bc0361f9e8"
}

[Nest] 7  - 05/13/2024, 11:35:07 AM   ERROR [JobService] Failed to execute job handler (thumbnailGeneration/generate-preview): Error: ffprobe exited with code 1
ffprobe version 6.0.1-Jellyfin Copyright (c) 2007-2023 the FFmpeg developers
  Built with gcc 12 (Debian 12.2.0-14)
  Configuration: --prefix=/usr/lib/jellyfin-ffmpeg --target-os=linux --extra-version=Jellyfin --disable-doc --disable-ffplay --disable-ptx-compression --disable-static --disable-libxcb --disable-sdl2 --disable-xlib --enable-lto --enable-gpl --enable-version3 --enable-shared --enable-gmp --enable-gnutls --enable-chromaprint --enable-opencl --enable-libdrm --enable-libass --enable-libfreetype --enable-libfribidi --enable-libfontconfig --enable-libbluray --enable-libmp3lame --enable-libopus --enable-libtheora --enable-libvorbis --enable-libopenmpt --enable-libdav1d --enable-libsvtav1 --enable-libwebp --enable-libvpx --enable-libx264 --enable-libx265 --enable-libzvbi --enable-libzimg --enable-libfdk-aac --arch=amd64 --enable-libshaderc --enable-libplacebo --enable-vulkan --enable-vaapi --enable-amf --enable-libvpl --enable-ffnvcodec --enable-cuda --enable-cuda-llvm --enable-cuvid --enable-nvdec --enable-nvenc
  libavutil      58.  2.100 / 58.  2.100
  libavcodec     60.  3.100 / 60.  3.100
  libavformat    60.  3.100 / 60.  3.100
  libavdevice    60.  1.100 / 60.  1.100
  libavfilter     9.  3.100 /  9.  3.100
  libswscale      7.  1.100 /  7.  1.100
  libswresample   4. 10.100 /  4. 10.100
  libpostproc    57.  1.100 / 57.  1.100
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x36da0190180] Invalid sample size -1008
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x36da0190180] Error reading header
upload/upload/98a61f7d-3957-465a-b946-79697c36d3b1/59/5d/595d905e-3bde-49ef-a699-17b677622b3c.mp4: Invalid data found when processing input

[Nest] 7  - 05/13/2024, 11:35:07 AM   ERROR [JobService] Error: ffprobe exited with code 1
ffprobe version 6.0.1-Jellyfin Copyright (c) 2007-2023 the FFmpeg developers
  Built with gcc 12 (Debian 12.2.0-14)
  Configuration: --prefix=/usr/lib/jellyfin-ffmpeg --target-os=linux --extra-version=Jellyfin --disable-doc --disable-ffplay --disable-ptx-compression --disable-static --disable-libxcb --disable-sdl2 --disable-xlib --enable-lto --enable-gpl --enable-version3 --enable-shared --enable-gmp --enable-gnutls --enable-chromaprint --enable-opencl --enable-libdrm --enable-libass --enable-libfreetype --enable-libfribidi --enable-libfontconfig --enable-libbluray --enable-libmp3lame --enable-libopus --enable-libtheora --enable-libvorbis --enable-libopenmpt --enable-libdav1d --enable-libsvtav1 --enable-libwebp --enable-libvpx --enable-libx264 --enable-libx265 --enable-libzvbi --enable-libzimg --enable-libfdk-aac --arch=amd64 --enable-libshaderc --enable-libplacebo --enable-vulkan --enable-vaapi --enable-amf --enable-libvpl --enable-ffnvcodec --enable-cuda --enable-cuda-llvm --enable-cuvid --enable-nvdec --enable-nvenc
  libavutil      58.  2.100 / 58.  2.100
  libavcodec     60.  3.100 / 60.  3.100
  libavformat    60.  3.100 / 60.  100
  libavdevice    60.  1.100 / 60.  1.100
  libavfilter     9.  3.100 /  9.  3.100
  libswscale      7.  1.100 /  7.  1.100
  libswresample   4. 10.100 /  4. 10.100
  libpostproc    57.  1.100 / 57.  1.100
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x36da0190180] Invalid sample size -1008
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x36da0190180] Error reading header
upload/upload/98a61f7d-3957-465a-b946-79697c36d3b1/59/5d/595d905e-3bde-49ef-a699-17b677622b3c.mp4: Invalid data found when processing input

    at ChildProcess.<anonymous> (/usr/src/app/node_modules/fluent-ffmpeg/lib/ffprobe.js:233:22)
    at ChildProcess.emit (node:events:518:28)
    at ChildProcess._handle.onexit (node:internal/child_process:294:12)
[Nest] 7  - 05/13/2024, 11:35:07 AM   ERROR [JobService] Affected Job ID:
{
  "id": "ee18e5e8-5430-4fcb-aa10-ef4170e2b30e"
}

Limiting the container's resource usage has indeed helped me; it prevents my server from crashing. However, upon examining the microservices' logs, I noticed frequent restarts, indicating that the microservices are indeed exceeding the memory limit. Even after setting a 4GB memory limit, it's still insufficient.

Here's the relevant part of my Docker Compose configuration:

  immich-microservices:
    container_name: immich_microservices
    image: altran1502/immich-server:${IMMICH_VERSION:-release}
    command: ['start.sh', 'microservices']
    volumes:
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - "/root/sharedfolder/syncthing/Photo_Album:/mnt/media/Photo_Album:ro"
      - /etc/localtime:/etc/localtime:ro
    devices:
      - /dev/dri:/dev/dri
    env_file:
      - .env
    depends_on:
      - redis
      - database
    restart: always
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G
[Nest] 7  - 05/13/2024, 11:35:19 AM     LOG [EventRepository] Initialized websocket server
[Nest] 7  - 05/13/2024, 11:35:33 AM     LOG [EventRepository] Initialized websocket server
[Nest] 7  - 05/13/2024, 11:35:45 AM     LOG [EventRepository] Initialized websocket server
[Nest] 7  - 05/13/2024, 11:35:58 AM     LOG [EventRepository] Initialized websocket server
[Nest] 7  - 05/13/2024, 11:36:11 AM     LOG [EventRepository] Initialized websocket server
[Nest] 7  - 05/13/2024, 11:36:53 AM     LOG [EventRepository] Initialized websocket server
[Nest] 7  - 05/13/2024, 11:37:06 AM     LOG [EventRepository] Initialized websocket server
[Nest] 7  - 05/13/2024, 11:37:19 AM     LOG [EventRepository] Initialized websocket server
[Nest] 7  - 05/13/2024, 11:37:34 AM     LOG [EventRepository] Initialized websocket server
[Nest] 7  - 05/13/2024, 11:37:47 AM     LOG [EventRepository] Initialized websocket server

I'm still hoping to identify and resolve the root cause entirely. If more information is needed, please let me know.

On a side note, how can I persistently save Immich logs? The log path isn't mentioned in the user documentation, so I haven't mapped any volumes, resulting in log loss upon container restarts.

Thank you very much!

@ceebu
Copy link

ceebu commented May 17, 2024

I have the same problem (i5 8th gen server, running OMV 6) - am also Looking for a easy way to save immich logs -

@RazerProof
Copy link

RazerProof commented May 22, 2024

screenshot
I can also confirm this issue is affecting the installation on my NAS. Limiting memory causes frequent restarts of the microservices container. I also tried setting the Generate Thumbnails Concurrency to 1, but it didn't help. I found the NAS started overutalising the drives as they were at 100% utilisation. I imagine it was paging out memory to disk; CPU utilization was around 40%. I have increased the RAM in the NAS to 32GB, which has solved the problem for me, and the microservices container now runs constantly, with no restarting at all. The microservices container stabilised at around 28GB, the NAS memory resource monitor shows almost all of that usage is cache memory, the container limit is set to 4GB.
After restarting the container, it takes about 15-20min to stabilise at this level. Disk usage is now around 40% and CPU is around 80%, generating thumbnails at around 30/min. This is with no other jobs running.
I have around 230k assets in the external library, 50%jpg, 50%RAW. I also tried this in docker on a server (TrueNAS scale) and experienced the same issue and outcome.
Overall, I love the way Immich is going. It is going to be a great product. Well done to everyone involved, and thanks for the great work.

@mertalev
Copy link
Contributor

Thanks for the detailed info!

  • How many of those assets are RAW, if any?
  • Do you have /tmp configured to be in-memory?
  • Is the increase completely linear, or are there spikes?
  • Any errors in the logs?

@RazerProof
Copy link

RazerProof commented May 22, 2024

How many of those assets are RAW, if any? 50% of the images are RAW around 50GB each.
Do you have /tmp configured to be in-memory? No /tmp configured.
Is the increase completely linear, or are there spikes? Mostly linear overlayed with a small saw tooth shape.
image
image

I just restarted the container. That smooth section is unusual, and it will go back to the sawtooth.
Any errors in the logs? No Errors, running smoothly.

@RazerProof
Copy link

Below is an example of the errors I was receiving on the NAS when 4GB was installed and getting constant restarts of the microservices container.
Capture3

@mertalev
Copy link
Contributor

I made a test image for microservices with a possible fix: ghcr.io/immich-app/immich-server:pr-9665. Would you be able to change your image to that and see if it affects RAM usage?

@RazerProof
Copy link

Not good news I am afraid. The immich_microservices fails.

immich_microservices
date stream content
2024/05/23 01:09:52 stderr Microservices worker exited with code 1

2024/05/23 01:09:52 stderr }

2024/05/23 01:09:52 stderr routine: 'parserOpenTable'

2024/05/23 01:09:52 stderr line: '1381',
2024/05/23 01:09:52 stderr file: 'parse_relation.c',
2024/05/23 01:09:52 stderr constraint: undefined,
2024/05/23 01:09:52 stderr dataType: undefined,
2024/05/23 01:09:52 stderr column: undefined,
2024/05/23 01:09:52 stderr table: undefined,
2024/05/23 01:09:52 stderr schema: undefined,
2024/05/23 01:09:52 stderr where: undefined,
2024/05/23 01:09:52 stderr internalQuery: undefined,
2024/05/23 01:09:52 stderr internalPosition: undefined,
2024/05/23 01:09:52 stderr position: '128',
2024/05/23 01:09:52 stderr hint: undefined,
2024/05/23 01:09:52 stderr detail: undefined,
2024/05/23 01:09:52 stderr code: '42P01',
2024/05/23 01:09:52 stderr severity: 'ERROR',
2024/05/23 01:09:52 stderr length: 113,
2024/05/23 01:09:52 stderr },
2024/05/23 01:09:52 stderr routine: 'parserOpenTable'

2024/05/23 01:09:52 stderr line: '1381',
2024/05/23 01:09:52 stderr file: 'parse_relation.c',
2024/05/23 01:09:52 stderr constraint: undefined,
2024/05/23 01:09:52 stderr dataType: undefined,
2024/05/23 01:09:52 stderr column: undefined,
2024/05/23 01:09:52 stderr table: undefined,
2024/05/23 01:09:52 stderr schema: undefined,
2024/05/23 01:09:52 stderr where: undefined,
2024/05/23 01:09:52 stderr internalQuery: undefined,
2024/05/23 01:09:52 stderr internalPosition: undefined,
2024/05/23 01:09:52 stderr position: '128',
2024/05/23 01:09:52 stderr hint: undefined,
2024/05/23 01:09:52 stderr detail: undefined,
2024/05/23 01:09:52 stderr code: '42P01',
2024/05/23 01:09:52 stderr severity: 'ERROR',
2024/05/23 01:09:52 stderr length: 113,

@mertalev
Copy link
Contributor

Hmm, that error is about connecting to Postgres, not related to thumbnail generation.

@RazerProof
Copy link

Yeah... Something may have gone wrong with the deployment. Let me do some more testing, and maybe get some sleep :-).

@mertalev
Copy link
Contributor

So sorry! I based that branch off of the latest release, but it turns out that main gets merged into it anyway when the image is built. The error is probably because of that.

You can either wait for the next release to get things back up or restore from a backup. (It's also possible to mess with it more to get it back up, but I think these options are safer.)

@RazerProof
Copy link

Thanks for letting me know and I can confirm I am getting the same result.
image
It is working, and there are no errors in the log; it's just hungry :-)
I will wait for the next release and re-test
Thanks for all your great work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants