Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection with 60 points has huge size and write ahead log #4159

Open
volkerstampa opened this issue May 2, 2024 · 8 comments
Open

Collection with 60 points has huge size and write ahead log #4159

volkerstampa opened this issue May 2, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@volkerstampa
Copy link

Current Behavior

We have a collection where Qdrant show unexpected behavior for now obvious reason:

The following snippets use a Qdrant Version: 1.9.0, build: b99d507 started through docker-compose.

services:
  qdrant:
    container_name: "document-index-qdrant"
    image: "qdrant/qdrant"
    ports:
      - "6333:6333" # REST
      - "6334:6334" # gRPC

Thie is the collection:

$ curl http://localhost:6333/collections | jq
{
  "result": {
    "collections": [
      {
        "name": "demo"
      }
    ]
  },
  "status": "ok",
  "time": 0.000010792
}

Pretty much default settings. However its already unclear why the collection needs 8 segments and why the indexed_vectors_count is 0:

$ curl http://localhost:6333/collections/demo | jq
{
  "result": {
    "status": "green",
    "optimizer_status": "ok",
    "indexed_vectors_count": 0,
    "points_count": 60,
    "segments_count": 8,
    "config": {
      "params": {
        "vectors": {
          "size": 128,
          "distance": "Cosine"
        },
        "shard_number": 1,
        "replication_factor": 1,
        "write_consistency_factor": 1,
        "on_disk_payload": true
      },
      "hnsw_config": {
        "m": 16,
        "ef_construct": 100,
        "full_scan_threshold": 10000,
        "max_indexing_threads": 0,
        "on_disk": false
      },
      "optimizer_config": {
        "deleted_threshold": 0.2,
        "vacuum_min_vector_number": 1000,
        "default_segment_number": 0,
        "max_segment_size": null,
        "memmap_threshold": null,
        "indexing_threshold": 20000,
        "flush_interval_sec": 5,
        "max_optimization_threads": null
      },
      "wal_config": {
        "wal_capacity_mb": 32,
        "wal_segments_ahead": 0
      },
      "quantization_config": null
    },
    "payload_schema": {}
  },
  "status": "ok",
  "time": 0.000218042
}

The 60 points with some small payload (same structure for all points) can be confirmed:

curl -X POST http://localhost:6333/collections/demo/points/scroll -d '{"with_vector": false, "limit": 61}'
{
  "result": {
    "points": [
      {
        "id": 40524188963831808,
        "payload": {
          "chunk_range": {
            "from": {
              "TextReference": {
                "byte_offset": 0,
                "item_index": 0
              }
            },
            "to": {
              "TextReference": {
                "byte_offset": 1790,
                "item_index": 0
              }
            }
          },
          "document_id": 9435273,
          "embedding_version": 0,
          "metatdata_version": "V1",
          "modalities": [
            "text"
          ]
        },
        "vector": null
      },
      ...
      {
        "id": 40524442366902272,
        "payload": {
          "chunk_range": {
            "from": {
              "TextReference": {
                "byte_offset": 0,
                "item_index": 0
              }
            },
            "to": {
              "TextReference": {
                "byte_offset": 415,
                "item_index": 0
              }
            }
          },
          "document_id": 9435332,
          "embedding_version": 0,
          "metatdata_version": "V1",
          "modalities": [
            "text"
          ]
        },
        "vector": null
      }
    ],
    "next_page_offset": null
  },
  "status": "ok",
  "time": 0.002730958
}

Nonetheless the snapshot has a size of 1.6 GB:

$ curl -X POST http://localhost:6333/collections/demo/snapshots | jq
{
  "result": {
    "name": "demo-5104423205986989-2024-05-02-10-38-23.snapshot",
    "creation_time": "2024-05-02T10:38:24",
    "size": 1745611264,
    "checksum": "f5562887c71f97351bc2bde7f672b55afc3fb66c201be2b91c4ec7cd53742a40"
  },
  "status": "ok",
  "time": 5.427237377
}

$ curl http://localhost:6333/collections/demo/snapshots/demo-5104423205986989-2024-05-02-10-38-23.snapshot -o demo.snapshot

$ ls -lah demo.snapshot 
-rw-r--r--  1 volker.stampa  staff   1.6G May  2 12:39 demo.snapshot

And restarting Qdrant show a write ahead log of significant size:

$ docker restart document-index-qdrant

$ docker logs --tail 10 --follow document-index-qdrant
| (_| | (_| | | | (_| | | | | |_  
 \__, |\__,_|_|  \__,_|_| |_|\__| 
    |_|                           

Version: 1.9.0, build: b99d5074
Access web UI at http://localhost:6333/dashboard

2024-05-02T10:54:08.195923Z  INFO storage::content_manager::consensus::persistent: Loading raft state from ./storage/raft_state.json    
2024-05-02T10:54:08.206008Z  INFO storage::content_manager::toc: Loading collection: demo    
2024-05-02T10:54:09.120426Z  INFO collection::shards::local_shard: Recovering collection demo: 0/761424 (0%)    
2024-05-02T10:55:09.122449Z  INFO collection::shards::local_shard: 40177/761424 (5%)    
2024-05-02T10:56:09.124250Z  INFO collection::shards::local_shard: 79449/761424 (10%)    
2024-05-02T10:57:09.124659Z  INFO collection::shards::local_shard: 118434/761424 (15%)    
2024-05-02T10:58:09.113853Z  INFO collection::shards::local_shard: 157517/761424 (20%)    
2024-05-02T10:59:09.101792Z  INFO collection::shards::local_shard: 196525/761424 (25%)    
2024-05-02T11:00:09.101910Z  INFO collection::shards::local_shard: 235484/761424 (30%)    
2024-05-02T11:01:09.103851Z  INFO collection::shards::local_shard: 274200/761424 (36%)    
2024-05-02T11:02:09.104881Z  INFO collection::shards::local_shard: 312838/761424 (41%)    
2024-05-02T11:03:09.105430Z  INFO collection::shards::local_shard: 351533/761424 (46%)    
2024-05-02T11:04:09.105441Z  INFO collection::shards::local_shard: 390248/761424 (51%)    
2024-05-02T11:05:09.105633Z  INFO collection::shards::local_shard: 428793/761424 (56%)    
2024-05-02T11:06:09.106670Z  INFO collection::shards::local_shard: 467340/761424 (61%)    
2024-05-02T11:07:09.106725Z  INFO collection::shards::local_shard: 506002/761424 (66%)    
2024-05-02T11:08:09.106169Z  INFO collection::shards::local_shard: 544867/761424 (71%)    
2024-05-02T11:09:09.108001Z  INFO collection::shards::local_shard: 583663/761424 (76%)    
2024-05-02T11:10:09.109251Z  INFO collection::shards::local_shard: 622434/761424 (81%)    
2024-05-02T11:11:09.109702Z  INFO collection::shards::local_shard: 661062/761424 (86%)    
2024-05-02T11:12:09.109698Z  INFO collection::shards::local_shard: 699765/761424 (91%)    
2024-05-02T11:13:09.111093Z  INFO collection::shards::local_shard: 738410/761424 (96%)    
2024-05-02T11:13:44.819426Z  INFO collection::shards::local_shard: Recovered collection demo: 761424/761424 (100%)    

Steps to Reproduce

Unfortunately I have no idea how to reproduce such a collection. I could provide the snapshot (export/import does not change the behavior), but its 1.6G.

Expected Behavior

  • small snapshot (64M)
  • fast startup with no write ahead log

Context (Environment)

Main problem with this collection is that with it restart of Qdrant takes forever and during restart Qdrant is unavailable.

@volkerstampa volkerstampa added the bug Something isn't working label May 2, 2024
@timvisee
Copy link
Member

timvisee commented May 2, 2024

I seems to have collected an enormous amount of operations in the WAL, which would explain the large size.

Could you elaborate on what operations you've been using lately?

A delete by filter operation is known to cause this. But I don't think you've used 76k of them in sequence, or did you?

@volkerstampa
Copy link
Author

volkerstampa commented May 2, 2024

Thanks for the really quick response :-)

We do use client.delete_points with a PointsSelectorOneOf with a filter that filters by these document_id that are part of the payload. As far as I can see we only did ~80 updates/deletes in total on the 60 points, but I am not necessarily 100% sure on that.

Even if we did 76K in sequence is there a chance to get rid of the WAL somehow or is this just something that must be avoided?

@timvisee
Copy link
Member

timvisee commented May 2, 2024

Even if we did 76K in sequence is there a chance to get rid of the WAL somehow or is this just something that must be avoided?

Any update operation affecting points should resolve that (insert or update).

Could you verify whether you've indeed send just 80 operations. If you did, something else is going on.

During this whole process, did you see any errors or weird output in the logs? There may be an additional hint in there.

Based on the above, I do assume that you did not make any configuration changes.

Of course, I was just guessing here. It may be a different issue.

@timvisee timvisee closed this as completed May 6, 2024
@volkerstampa
Copy link
Author

Sorry for the long delay, but I was distracted. I checked our logs and this is what I figured out:

We had 81 inserts on this collection and unfortunately I do not know exactly how many delete operations but in the end we have just 60 points in the collection so something between 1 and 21 I guess. If I am not mistaken there were no updates.

I also checked the Qdrant logs and there were basically only logs about creating collections, but no errors or warnings or the like.

The only thing that might be special is that the collection was created and deleted a couple of times.

Next I will check if I can get rid of the WAL by inserting/updating anything.

@volkerstampa
Copy link
Author

@timvisee Could we re-open this issue please?

@timvisee timvisee reopened this May 7, 2024
@timvisee
Copy link
Member

timvisee commented May 7, 2024

The only thing that might be special is that the collection was created and deleted a couple of times.

That shouldn't be an issue. Recreating a collection basically lets you start from scratch, wiping all previous WAL data for that collection.

@volkerstampa
Copy link
Author

Verified that inserting a point into this collection removed the WAL. After that qdrant starts up very quickly and log indicates this WAL size of "1". Creating a snapshot is also very fast. It still uses 96M, but way less than the previous 1.6GB. So for me this fixes the issue at hand. The question that remains is how did Qdrant get into this state. I could provide you with the 1.6GB snapshot file, if you are interested or try to analyze it on my side, if you could give me hints on how to do this.

@timvisee
Copy link
Member

timvisee commented May 16, 2024

I could provide you with the 1.6GB snapshot file

Yes please, that would be helpful if you still have it available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants