Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking-issue] Settings diff-indexing #4493

Open
1 task
ManyTheFish opened this issue Mar 14, 2024 · 1 comment
Open
1 task

[Tracking-issue] Settings diff-indexing #4493

ManyTheFish opened this issue Mar 14, 2024 · 1 comment
Labels
performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption settings diff-indexing Issues related to settings diff-indexing tracking issue Tracks development of a global issue

Comments

@ManyTheFish
Copy link
Member

ManyTheFish commented Mar 14, 2024

Related product team resources: PRD (internal only)

Motivation

The indexing process is split into several parts, extracting different data from the documents depending on the settings.
Each extraction can be precisely linked to one or several settings when running indexing, but Meilisearch v1.7 and older versions reindex everything like it was fresh indexing.
So, even if the original documents have not been changed, Meilisearch will delete them to rewrite the same data with different internal IDs.
Changing settings should only reindex the data related to the changes.
This issue lists all the issues related to the diff-indexing subject.

TODO

P1:

P2:

P3:

Impacted teams

  • @meilisearch/docs-team: no API changes, but solving this issue will no longer trigger a full reindexing when changing the settings, we may remove any mention of it.
@ManyTheFish ManyTheFish added tracking issue Tracks development of a global issue performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption settings diff-indexing Issues related to settings diff-indexing labels Mar 14, 2024
@dureuill
Copy link
Contributor

Hello, some information on the specific case of embedders.

  1. Changes to 1 embedder should not trigger the reindexing of all embedders. Only the modified embedder should be reindexed
    • In the case where one embedder is deleted, the code currently benefits from the "reindex everything" behavior in that the associated index of embedders after the deleted embedder is modified to not leave a "hole" in the list of embedders. A free list is required to support incremental deleting of embedder.
  2. Currently, changes to searchable, displayable, sortable, filterable, etc. do not need to trigger a reindexing. In the future, the properties of a field might be accessible to the document template, madking reindexing mandatory in this case.
  3. A change to the apiKey (for OpenAI models) should not trigger a reindexing operation of the modified embedder.
  4. When modifying the documentTemplate, a reindexing is necessary, but would be more minimal if comparing the rendered versions of both the old and the new template, and only regenerating embeddings for documents where the rendered version actually changed.
  5. Future extensions:
    1. distributionShift: should not trigger a reindexing operation.
    2. distance: needs reindexing

meili-bors bot added a commit that referenced this issue May 13, 2024
4624: Add "precommands" to benchmark r=dureuill a=dureuill

# Pull Request

## Related issue
Helps for #4493

## What does this PR do?
- Add support for precommands for cargo xtask bench
- update benchmark docs
- update workload files


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption settings diff-indexing Issues related to settings diff-indexing tracking issue Tracks development of a global issue
Projects
None yet
Development

No branches or pull requests

2 participants