Collection with 60 points has huge size and write ahead log #4159

volkerstampa · 2024-05-02T14:05:50Z

Current Behavior

We have a collection where Qdrant show unexpected behavior for now obvious reason:

The following snippets use a Qdrant Version: 1.9.0, build: b99d507 started through docker-compose.

services:
  qdrant:
    container_name: "document-index-qdrant"
    image: "qdrant/qdrant"
    ports:
      - "6333:6333" # REST
      - "6334:6334" # gRPC

Thie is the collection:

$ curl http://localhost:6333/collections | jq
{
  "result": {
    "collections": [
      {
        "name": "demo"
      }
    ]
  },
  "status": "ok",
  "time": 0.000010792
}

Pretty much default settings. However its already unclear why the collection needs 8 segments and why the indexed_vectors_count is 0:

$ curl http://localhost:6333/collections/demo | jq
{
  "result": {
    "status": "green",
    "optimizer_status": "ok",
    "indexed_vectors_count": 0,
    "points_count": 60,
    "segments_count": 8,
    "config": {
      "params": {
        "vectors": {
          "size": 128,
          "distance": "Cosine"
        },
        "shard_number": 1,
        "replication_factor": 1,
        "write_consistency_factor": 1,
        "on_disk_payload": true
      },
      "hnsw_config": {
        "m": 16,
        "ef_construct": 100,
        "full_scan_threshold": 10000,
        "max_indexing_threads": 0,
        "on_disk": false
      },
      "optimizer_config": {
        "deleted_threshold": 0.2,
        "vacuum_min_vector_number": 1000,
        "default_segment_number": 0,
        "max_segment_size": null,
        "memmap_threshold": null,
        "indexing_threshold": 20000,
        "flush_interval_sec": 5,
        "max_optimization_threads": null
      },
      "wal_config": {
        "wal_capacity_mb": 32,
        "wal_segments_ahead": 0
      },
      "quantization_config": null
    },
    "payload_schema": {}
  },
  "status": "ok",
  "time": 0.000218042
}

The 60 points with some small payload (same structure for all points) can be confirmed:

curl -X POST http://localhost:6333/collections/demo/points/scroll -d '{"with_vector": false, "limit": 61}'
{
  "result": {
    "points": [
      {
        "id": 40524188963831808,
        "payload": {
          "chunk_range": {
            "from": {
              "TextReference": {
                "byte_offset": 0,
                "item_index": 0
              }
            },
            "to": {
              "TextReference": {
                "byte_offset": 1790,
                "item_index": 0
              }
            }
          },
          "document_id": 9435273,
          "embedding_version": 0,
          "metatdata_version": "V1",
          "modalities": [
            "text"
          ]
        },
        "vector": null
      },
      ...
      {
        "id": 40524442366902272,
        "payload": {
          "chunk_range": {
            "from": {
              "TextReference": {
                "byte_offset": 0,
                "item_index": 0
              }
            },
            "to": {
              "TextReference": {
                "byte_offset": 415,
                "item_index": 0
              }
            }
          },
          "document_id": 9435332,
          "embedding_version": 0,
          "metatdata_version": "V1",
          "modalities": [
            "text"
          ]
        },
        "vector": null
      }
    ],
    "next_page_offset": null
  },
  "status": "ok",
  "time": 0.002730958
}

Nonetheless the snapshot has a size of 1.6 GB:

$ curl -X POST http://localhost:6333/collections/demo/snapshots | jq
{
  "result": {
    "name": "demo-5104423205986989-2024-05-02-10-38-23.snapshot",
    "creation_time": "2024-05-02T10:38:24",
    "size": 1745611264,
    "checksum": "f5562887c71f97351bc2bde7f672b55afc3fb66c201be2b91c4ec7cd53742a40"
  },
  "status": "ok",
  "time": 5.427237377
}

$ curl http://localhost:6333/collections/demo/snapshots/demo-5104423205986989-2024-05-02-10-38-23.snapshot -o demo.snapshot

$ ls -lah demo.snapshot 
-rw-r--r--  1 volker.stampa  staff   1.6G May  2 12:39 demo.snapshot

And restarting Qdrant show a write ahead log of significant size:

$ docker restart document-index-qdrant

$ docker logs --tail 10 --follow document-index-qdrant
| (_| | (_| | | | (_| | | | | |_  
 \__, |\__,_|_|  \__,_|_| |_|\__| 
    |_|                           

Version: 1.9.0, build: b99d5074
Access web UI at http://localhost:6333/dashboard

2024-05-02T10:54:08.195923Z  INFO storage::content_manager::consensus::persistent: Loading raft state from ./storage/raft_state.json    
2024-05-02T10:54:08.206008Z  INFO storage::content_manager::toc: Loading collection: demo    
2024-05-02T10:54:09.120426Z  INFO collection::shards::local_shard: Recovering collection demo: 0/761424 (0%)    
2024-05-02T10:55:09.122449Z  INFO collection::shards::local_shard: 40177/761424 (5%)    
2024-05-02T10:56:09.124250Z  INFO collection::shards::local_shard: 79449/761424 (10%)    
2024-05-02T10:57:09.124659Z  INFO collection::shards::local_shard: 118434/761424 (15%)    
2024-05-02T10:58:09.113853Z  INFO collection::shards::local_shard: 157517/761424 (20%)    
2024-05-02T10:59:09.101792Z  INFO collection::shards::local_shard: 196525/761424 (25%)    
2024-05-02T11:00:09.101910Z  INFO collection::shards::local_shard: 235484/761424 (30%)    
2024-05-02T11:01:09.103851Z  INFO collection::shards::local_shard: 274200/761424 (36%)    
2024-05-02T11:02:09.104881Z  INFO collection::shards::local_shard: 312838/761424 (41%)    
2024-05-02T11:03:09.105430Z  INFO collection::shards::local_shard: 351533/761424 (46%)    
2024-05-02T11:04:09.105441Z  INFO collection::shards::local_shard: 390248/761424 (51%)    
2024-05-02T11:05:09.105633Z  INFO collection::shards::local_shard: 428793/761424 (56%)    
2024-05-02T11:06:09.106670Z  INFO collection::shards::local_shard: 467340/761424 (61%)    
2024-05-02T11:07:09.106725Z  INFO collection::shards::local_shard: 506002/761424 (66%)    
2024-05-02T11:08:09.106169Z  INFO collection::shards::local_shard: 544867/761424 (71%)    
2024-05-02T11:09:09.108001Z  INFO collection::shards::local_shard: 583663/761424 (76%)    
2024-05-02T11:10:09.109251Z  INFO collection::shards::local_shard: 622434/761424 (81%)    
2024-05-02T11:11:09.109702Z  INFO collection::shards::local_shard: 661062/761424 (86%)    
2024-05-02T11:12:09.109698Z  INFO collection::shards::local_shard: 699765/761424 (91%)    
2024-05-02T11:13:09.111093Z  INFO collection::shards::local_shard: 738410/761424 (96%)    
2024-05-02T11:13:44.819426Z  INFO collection::shards::local_shard: Recovered collection demo: 761424/761424 (100%)

Steps to Reproduce

Unfortunately I have no idea how to reproduce such a collection. I could provide the snapshot (export/import does not change the behavior), but its 1.6G.

Expected Behavior

small snapshot (64M)
fast startup with no write ahead log

Context (Environment)

Main problem with this collection is that with it restart of Qdrant takes forever and during restart Qdrant is unavailable.

The text was updated successfully, but these errors were encountered:

timvisee · 2024-05-02T14:13:40Z

I seems to have collected an enormous amount of operations in the WAL, which would explain the large size.

Could you elaborate on what operations you've been using lately?

A delete by filter operation is known to cause this. But I don't think you've used 76k of them in sequence, or did you?

volkerstampa · 2024-05-02T14:35:41Z

Thanks for the really quick response :-)

We do use client.delete_points with a PointsSelectorOneOf with a filter that filters by these document_id that are part of the payload. As far as I can see we only did ~80 updates/deletes in total on the 60 points, but I am not necessarily 100% sure on that.

Even if we did 76K in sequence is there a chance to get rid of the WAL somehow or is this just something that must be avoided?

timvisee · 2024-05-02T14:42:54Z

Even if we did 76K in sequence is there a chance to get rid of the WAL somehow or is this just something that must be avoided?

Any update operation affecting points should resolve that (insert or update).

Could you verify whether you've indeed send just 80 operations. If you did, something else is going on.

During this whole process, did you see any errors or weird output in the logs? There may be an additional hint in there.

Based on the above, I do assume that you did not make any configuration changes.

Of course, I was just guessing here. It may be a different issue.

volkerstampa · 2024-05-06T10:38:48Z

Sorry for the long delay, but I was distracted. I checked our logs and this is what I figured out:

We had 81 inserts on this collection and unfortunately I do not know exactly how many delete operations but in the end we have just 60 points in the collection so something between 1 and 21 I guess. If I am not mistaken there were no updates.

I also checked the Qdrant logs and there were basically only logs about creating collections, but no errors or warnings or the like.

The only thing that might be special is that the collection was created and deleted a couple of times.

Next I will check if I can get rid of the WAL by inserting/updating anything.

volkerstampa · 2024-05-06T11:27:23Z

@timvisee Could we re-open this issue please?

timvisee · 2024-05-07T08:32:10Z

The only thing that might be special is that the collection was created and deleted a couple of times.

That shouldn't be an issue. Recreating a collection basically lets you start from scratch, wiping all previous WAL data for that collection.

volkerstampa · 2024-05-07T09:35:47Z

Verified that inserting a point into this collection removed the WAL. After that qdrant starts up very quickly and log indicates this WAL size of "1". Creating a snapshot is also very fast. It still uses 96M, but way less than the previous 1.6GB. So for me this fixes the issue at hand. The question that remains is how did Qdrant get into this state. I could provide you with the 1.6GB snapshot file, if you are interested or try to analyze it on my side, if you could give me hints on how to do this.

timvisee · 2024-05-16T12:57:11Z

I could provide you with the 1.6GB snapshot file

Yes please, that would be helpful if you still have it available.

volkerstampa added the bug Something isn't working label May 2, 2024

timvisee closed this as completed May 6, 2024

timvisee reopened this May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collection with 60 points has huge size and write ahead log #4159

Collection with 60 points has huge size and write ahead log #4159

volkerstampa commented May 2, 2024

timvisee commented May 2, 2024

volkerstampa commented May 2, 2024 •

edited

timvisee commented May 2, 2024

volkerstampa commented May 6, 2024

volkerstampa commented May 6, 2024

timvisee commented May 7, 2024

volkerstampa commented May 7, 2024

timvisee commented May 16, 2024 •

edited

Collection with 60 points has huge size and write ahead log #4159

Collection with 60 points has huge size and write ahead log #4159

Comments

volkerstampa commented May 2, 2024

Current Behavior

Steps to Reproduce

Expected Behavior

Context (Environment)

timvisee commented May 2, 2024

volkerstampa commented May 2, 2024 • edited

timvisee commented May 2, 2024

volkerstampa commented May 6, 2024

volkerstampa commented May 6, 2024

timvisee commented May 7, 2024

volkerstampa commented May 7, 2024

timvisee commented May 16, 2024 • edited

volkerstampa commented May 2, 2024 •

edited

timvisee commented May 16, 2024 •

edited