Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong facet_counts, missing facet values with Semantic Search #1714

Closed
Ku3mi41 opened this issue May 7, 2024 · 5 comments
Closed

wrong facet_counts, missing facet values with Semantic Search #1714

Ku3mi41 opened this issue May 7, 2024 · 5 comments

Comments

@Ku3mi41
Copy link

Ku3mi41 commented May 7, 2024

Description

Wrong facet_count while using vector_query. Sometime facet values didn't exist in response at all.

Steps to reproduce

  1. Collection schema
  2. Collection data
  3. Search query
{
  "query_by": "title,description,embedding",
  "vector_query": "embedding:([], distance_threshold:0.2)",
  "exclude_fields": "embedding",
  "highlight_affix_num_tokens": 10,
  "collection": "guru_data_TEST",
  "q": "npm",
  "facet_by": "type",
  "max_facet_values": 10,
  "exhaustive_search": true
}

Expected Behavior

Sum of all counts expect to be equal found. All facet values must be represented in counts sections (sometime it's not, especially on bigger collection).

Actual Behavior

    "facet_counts": [
        {
            "counts": [
                {
                    "count": 4,
                    "highlighted": "Инструкция",
                    "value": "Инструкция"
                },
                {
                    "count": 1,
                    "highlighted": "Шаблон",
                    "value": "Шаблон"
                },
                {
                    "count": 1,
                    "highlighted": "Обучение",
                    "value": "Обучение"
                },
                {
                    "count": 1,
                    "highlighted": "Разработка",
                    "value": "Разработка"
                }
            ],
            "field_name": "type",
            "sampled": false,
            "stats": {
                "total_values": 4
            }
        }
    ],
    "found": 8,

Typesense Version: 26

@ozanarmagan
Copy link
Contributor

ozanarmagan commented May 21, 2024

@Ku3mi41 This issue should not occur with 27.0.rc12. Could you please confirm by testing with this RC version?

@Ku3mi41
Copy link
Author

Ku3mi41 commented May 22, 2024

Okay, now found is exactly the same as the sum of all counts. But the counts itself still displays incorrect values. Moreover, the values ​​in counts strongly depend on the value of per_page.

per_page: 250
"facet_counts": [
        {
            "counts": [
                {
                    "count": 99,
                    "highlighted": "Обучение",
                    "value": "Обучение"
                },
                {
                    "count": 68,
                    "highlighted": "Инструкция",
                    "value": "Инструкция"
                },
                {
                    "count": 34,
                    "highlighted": "Статья",
                    "value": "Статья"
                },
                {
                    "count": 15,
                    "highlighted": "Сервис",
                    "value": "Сервис"
                },
                {
                    "count": 14,
                    "highlighted": "Техрадар",
                    "value": "Техрадар"
                },
                {
                    "count": 7,
                    "highlighted": "Опыт",
                    "value": "Опыт"
                },
                {
                    "count": 7,
                    "highlighted": "Шаблон",
                    "value": "Шаблон"
                },
                {
                    "count": 7,
                    "highlighted": "Разработка",
                    "value": "Разработка"
                },
                {
                    "count": 5,
                    "highlighted": "Мероприятие",
                    "value": "Мероприятие"
                },
                {
                    "count": 2,
                    "highlighted": "Паспорт",
                    "value": "Паспорт"
                }
            ],
            "field_name": "type",
            "sampled": false,
            "stats": {
                "total_values": 10
            }
        }
    ],
    "found": 258,
per_page: 5
"facet_counts": [
        {
            "counts": [
                {
                    "count": 30,
                    "highlighted": "Обучение",
                    "value": "Обучение"
                },
                {
                    "count": 29,
                    "highlighted": "Инструкция",
                    "value": "Инструкция"
                },
                {
                    "count": 16,
                    "highlighted": "Статья",
                    "value": "Статья"
                },
                {
                    "count": 10,
                    "highlighted": "Техрадар",
                    "value": "Техрадар"
                },
                {
                    "count": 6,
                    "highlighted": "Опыт",
                    "value": "Опыт"
                },
                {
                    "count": 6,
                    "highlighted": "Шаблон",
                    "value": "Шаблон"
                },
                {
                    "count": 5,
                    "highlighted": "Сервис",
                    "value": "Сервис"
                },
                {
                    "count": 4,
                    "highlighted": "Разработка",
                    "value": "Разработка"
                },
                {
                    "count": 3,
                    "highlighted": "Мероприятие",
                    "value": "Мероприятие"
                },
                {
                    "count": 2,
                    "highlighted": "Паспорт",
                    "value": "Паспорт"
                }
            ],
            "field_name": "type",
            "sampled": false,
            "stats": {
                "total_values": 10
            }
        }
    ],
    "found": 111,

@kishorenc
Copy link
Member

But the counts itself still displays incorrect values.

I checked but for the example data you have provided earlier, the counts are correct and matches the document field values returned in the response hits array. Can you please explain which facet count is different?

Moreover, the values ​​in counts strongly depend on the value of per_page.

Yes, this is because, with semantic search, we are doing a top K operation (it's too expensive to exhaustively do vector search on the entire collection). The k value is determined by the pagination depth, and facets are also only calculated for the actual results fetched. This is different from keyword search where we do facets on all matching results.

@kishorenc
Copy link
Member

You can set k to a higher value inside of vector_query and then set per_page to a lower value. If a k value is provided explicitly we will use that, and this approach will also preserve pagination (but you will need to limit it to a sane value for performance reasons).

@Ku3mi41
Copy link
Author

Ku3mi41 commented May 23, 2024

Yes, this is because, with semantic search, we are doing a top K operation

It explains everything. Could you add this clarification to the documentation? Without vector search, all the numbers look correct, at least in the first 10 queries that I checked. Part of my initial problem was that the number of results actually turned out to be greater than the facet predicted. Thanks for the fix and clarification, I think this issue can be closed.

@Ku3mi41 Ku3mi41 closed this as completed May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants