Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible behavior mismatch: Immich resets vector length in smart_search database when model is changed via UI, but does not perform this check if model is specified via config file #9398

Open
1 of 3 tasks
teamdest opened this issue May 11, 2024 · 7 comments

Comments

@teamdest
Copy link

teamdest commented May 11, 2024

The bug

When updating the CLIP model, Immich Server is expected to update the database table "smart_search" to set the data type of the "embedding" column to a vector width which matches the CLIP model selected.

I believe that currently this behavior is only being performed when that change is made IN THE WEB UI, and is not performed at system startup or during the config-file parsing, even if the CLIP model was changed.

It is possible that this behavioral issue is restricted to environments where there is no persistence (e.g. deployed as a pod in Kubernetes without an underlying persistent volume), and does not appear if the immich-server has a stable data source beneath it. I believe that performing this check on startup when using a config file would still be a more robust behavior.

The OS that Immich Server is running on

k8s on Ubuntu 22.04

Version of Immich Server

v1.103.1

Version of Immich Mobile App

N/A

Platform with the issue

  • Server
  • Web
  • Mobile

Your docker-compose.yml content

Deployed from k8s yaml files

Your .env content

Deployed from k8s yaml files

Reproduction steps

1. Deploy immich, using a configuration file pointed to by the "IMMICH_CONFIG_FILE" environment variable
2. Change the "ModelName" configuration parameter in the config file to one which uses a different vector width (e.g. 1024 instead of the default 512)
3. Run a "Smart Search" job on All assets

Relevant log output

immich-microservices-67fbc95c97-9nzg7 microservices [Nest] 1  - 05/11/2024, 1:26:32 PM   ERROR [JobService] Unable to run job handler (smartSearch/smart-search): QueryFailedError: pgvecto.rs: The given vector is invalid for input.
immich-microservices-67fbc95c97-9nzg7 microservices ADVICE: Check if dimensions and scalar type of the vector is matched with the index.
immich-microservices-67fbc95c97-9nzg7 microservices [Nest] 1  - 05/11/2024, 1:26:32 PM   ERROR [JobService] QueryFailedError: pgvecto.rs: The given vector is invalid for input.
immich-microservices-67fbc95c97-9nzg7 microservices ADVICE: Check if dimensions and scalar type of the vector is matched with the index.
immich-microservices-67fbc95c97-9nzg7 microservices     at PostgresQueryRunner.query (/usr/src/app/node_modules/typeorm/driver/postgres/PostgresQueryRunner.js:219:19)
immich-microservices-67fbc95c97-9nzg7 microservices     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
immich-microservices-67fbc95c97-9nzg7 microservices     at async InsertQueryBuilder.execute (/usr/src/app/node_modules/typeorm/query-builder/InsertQueryBuilder.js:106:33)
immich-microservices-67fbc95c97-9nzg7 microservices     at async SearchRepository.upsert (/usr/src/app/dist/repositories/search.repository.js:188:9)
immich-microservices-67fbc95c97-9nzg7 microservices     at async SmartInfoService.handleEncodeClip (/usr/src/app/dist/services/smart-info.service.js:90:9)
immich-microservices-67fbc95c97-9nzg7 microservices     at async /usr/src/app/dist/services/job.service.js:149:36
immich-microservices-67fbc95c97-9nzg7 microservices     at async Worker.processJob (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:394:28)
immich-microservices-67fbc95c97-9nzg7 microservices     at async Worker.retryIfFailed (/usr/src/app/node_modules/bullmq/dist/cjs/classes/worker.js:581:24)
immich-microservices-67fbc95c97-9nzg7 microservices [Nest] 1  - 05/11/2024, 1:26:32 PM   ERROR [JobService] Object:
immich-microservices-67fbc95c97-9nzg7 microservices {
immich-microservices-67fbc95c97-9nzg7 microservices   "id": "303f6f4b-da13-49fb-8b14-d9effc7a8506"
immich-microservices-67fbc95c97-9nzg7 microservices }

Additional information

Background

I stood up a new immich instance using Kubernetes.

In doing so, I initially brought the server pod up without a config file specified.

I changed several config variables and exported the JSON. While doing this, I did NOT change the "Clip Model" under Machine Learning Settings.

I added the exported JSON to a K8s ConfigMap and updated the Pod to mount that ConfigMap as a directory.

I updated the pod environment (for Server and Microservices) to include the "IMMICH_CONFIG_FILE" variable pointing to the mounted config file, and restarted the pods so they would use that config file. At this point the server came up and behaved normally, and exhibited no errors in the logs of any component.

I verified that the Server was indeed using the config file, as the "Settings" page of Administration stated that and locked any changes from being made.

I then edited the config file definition to change the Clip model to "ViT-g-14__laion2b-s12b-b42k".

I applied this change to the k8s configmap and restarted the server instance. it came up correctly, and showed the new model name in the "Settings" of the web administration tool. No errors or unusual messages were present in the log at this time.

Error detected

I then ran the "smart search" job for All assets (since I had changed the clip model). The "waiting" count did not go down, and error logs from the microservices pod indicated "the given vector is invalid for input" (see logs section).

Research

I found this discussion indicating a similar issue: #7425

The error seems to be directly related to a mismatch between the vector size of the "embedding" column in the smart_search table of the database vs. the vector size the CLIP model requires.

Resolving the issue requires making a change to the data type of the column (setting the vector size correctly), and rebuilding some indexes on the table.

A commenter in that discussion stated that the immich server should be performing these database changes when it detects that the size of the model has changed, and included two server log lines indicating such:

immich_server              | [Nest] 7  - 04/27/2024, 4:47:02 PM     LOG [SearchRepository] Dimension size of model ViT-L-14-336__openai is 768, but database expects 512.
immich_server              | [Nest] 7  - 04/27/2024, 4:47:02 PM     LOG [SearchRepository] Updating database CLIP dimension size to 768.

These lines never appeared in my server logs.

I then disabled the IMMICH_CONFIG_FILE environment variable and restarted the immich server pod. This reset the settings to defaults, including the CLIP model.

I then changed the model IN THE ADMIN WEB UI to the desired one and clicked save. The server immediately performed the expected database update.

I re-enabled the IMMICH_CONFIG_FILE environment variable and restarted the pod again. I was then able to perform a smart-search reindex using the desired CLIP model. successfully.

It's possible that this behavior/issue is limited to cases where a config file is being mounted into a pod via a Kubernetes ConfigMap, or cases where there is no persistence for immich-server except the config file.

@bo0tzz
Copy link
Member

bo0tzz commented May 11, 2024

That sounds correct, and would explain some issues we've seen. @mertalev thoughts?

@mertalev
Copy link
Contributor

It does (or should I guess) check if it should update the smart search table at startup. Specifically, it checks if the dimensions of the current model match the dimensions in the table and updates the table if not. It's possible there's a bug somewhere in that logic, though.

@teamdest
Copy link
Author

Apologies if I'm completely off on the wrong path, I don't know typescript at all or Javascript basically at all, but in the codebase I'm seeing the following control flow:

(inside the server section of the repo, and converting from the JS files on my drive to the TS files in the repo)

start.sh runs node on main -> src/main.js runs nestFactory.create() on apiModule -> src/app.module.ts constructor of ApiModule calls the init() of an ApiService -> src/services/api.service.ts calls init() on a databaseService -> src/services/database.service.ts

At this point, the following seems to occur:

  • if the postgres version is below the minimum major version, the code throws an error (14 is hardcoded as the minimum earlier in the file)
  • Attempts to lock the database with a lock of 'Databaselock.Migrations`
    • if this fails for some reason, nothing happens
  • If the lock succeeds:
    • Runs createVectorExtension() -> src/repositories/database.repository.ts runs a query of CREATE EXTENSION IF NOT EXISTS ${extension} where ${extension} is the preferred extension of vectors and vector. if this fails, an error is thrown
    • Runs updateVectorExtension() -> src/repositories/database.repository.ts
      • Compares the vector extension version to the expected one
      • If the version is outdated and the extension is vectors
        • Runs updateVectorSchema if the version is outdated
      • Runs a query setting the version of the extension to the expected one
      • if the version was not outdated, at this point it returns (note: it does not return an explicit value and I do not know what the default is in typescript?)
      • if the version WAS outdated
        • if the extension is vectors, run SELECT pgvectors_upgrade() and set "restartRequired" to TRUE
        • if the extension IS NOT vectors, run a reindex of FACE and CLIP indexes
          • At this point in the codepath, it seems to be updating the 'embeddings' column with a new dimSize
          • to get here, the extension would have to be "vector" (not "vectors") and an update had to happen
    • Runs assertVectorExtension() which just throws an error if the extension version is null, presumably as a final check that the above at least ~somewhat happened correctly.

Taking a brief break from bullet points to refocus: at this point we're at Line 36 of the database.service.ts file. At this point, the following happens:

  • shouldReindex is checked for each of FACE and CLIP -> src/repositories/database.repository.ts
    • if the extension isn't "vectors" it returns false immediately
    • if the extension IS "vectors", it runs SELECT idx_status FROM pg_vector_index_stat WHERE indexname for either FACE or CLIP
      • If the idx_status is "UPGRADE", it returns TRUE
      • if the idx_status is anything else, it returns FALSE
      • on error, if the response includes the words 'index is not existing', it returns TRUE
      • on error if the response includes the words 'relation "pg_vector_index_stat" does not exist', it returns FALSE
      • otherwise it throws the error
      • NOTE: this query returns NORMAL in my database for both items, so it would return FALSE

I am probably missing some sort of invalid state detail here. Maybe it's impossible to have it get to the end without it running this reindex at least once? But as I'm looking at it, it only gets to the point of running the reindex code (which is the code that does the dimSize column update) when an update to the vector extension was needed, which presumably never occurred here because this was a new install of an (idempotently-deployed) current version of the pgvecto.rs tensorchord container image.

@mertalev
Copy link
Contributor

The check happens in the search repository's init, referenced in the smart info service.

@mertalev
Copy link
Contributor

Oh, I guess the init in the smart info service isn't getting called anywhere haha.

@teamdest
Copy link
Author

Ah, so it’s just not calling that unless you change the model while running?

@mertalev
Copy link
Contributor

Yup, so it only updates on a config change event in practice. Making it run at startup should fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants