Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate and inconsistent results of BM25 search #4719

Open
1 task done
Jero1970 opened this issue Apr 18, 2024 · 0 comments
Open
1 task done

Duplicate and inconsistent results of BM25 search #4719

Jero1970 opened this issue Apr 18, 2024 · 0 comments
Labels

Comments

@Jero1970
Copy link

How to reproduce this bug?

While storing non-English texts in WSC and setting the Stopwords Preset for an inverted index (BM25) to None, there is a lot of search for high occurrence keywords happening and the keyword search gets compromised.

We can replicate it for English texts and demonstrate it a on the Quickstart Tutorial in Weviate Docs (https://weaviate.io/developers/weaviate/quickstart). Using the same code and the same data (10 entries from a TV quiz show "Jeopardy!" with properties ‘category’, ‘question’, ‘answer’) and searching via BM25 for a query "the science is" we get as a result 6 entries with the same score 0.35819104313850403.

Once you set Stopwords Preset to None in the collection definition (and allow for a search for high occurrence words), it starts giving you as a result more entries than there are in the original dataset (i. e. more than 10), meaning it starts giving you duplicate entries with different scores. To make things worse, the result is very different every time you make a fresh import of the dataset (aplies to both insert_many() and batch import).

Code used:

import weaviate
import weaviate.classes as wvc
import os
import requests
import json

client = weaviate.connect_to_wcs(
    cluster_url=os.getenv("WCS_URL"),
    auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WCS_API_KEY")),
    headers={
        # Replace with your inference API key
        "X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"]
    }
)

try:
    # ===== define collection =====
    questions = client.collections.create(
        name="Question",
        # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
        vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
        # Ensure the `generative-openai` module is used for generative queries
        generative_config=wvc.config.Configure.Generative.openai(),
        inverted_index_config=wvc.config.Configure.inverted_index(
            stopwords_preset=wvc.config.StopwordsPreset.NONE,
        ),
    )

    # ===== import data =====
    resp = requests.get(
        'https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json')
    data = json.loads(resp.text)  # Load data

    question_objs = list()
    for i, d in enumerate(data):
        question_objs.append({
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        })

    questions = client.collections.get("Question")
    questions.data.insert_many(question_objs)

    response = questions.query.bm25(
        query="the science is",
        return_metadata=wvc.query.MetadataQuery(score=True),
    )

    for obj in response.objects:
        print(obj.metadata.score)
        print(obj.metadata.explain_score)
        print(obj.properties)
        print()

finally:
    client.close()  # Close client gracefully

What is the expected behavior?

Consistent results without duplicates.

What is the actual behavior?

An example result in full (returns 14 objects, 4 duplicates - answers: DNA, Liver, Antelope, species):

0.6708551645278931
, BM25F_is_frequency:1, BM25F_is_propLength:14, BM25F_science_frequency:1, BM25F_science_propLength:1
{'answer': 'wire', 'question': 'A metal that is ductile can be pulled into this while cold & under pressure', 'category': 'SCIENCE'}

0.5260506272315979
, BM25F_is_frequency:1, BM25F_is_propLength:10, BM25F_the_frequency:1, BM25F_the_propLength:3
{'answer': 'the diamondback rattler', 'question': 'Heaviest of all poisonous snakes is this North American rattlesnake', 'category': 'ANIMALS'}

0.4687879681587219
, BM25F_science_propLength:1, BM25F_the_frequency:2, BM25F_the_propLength:14, BM25F_science_frequency:1
{'answer': 'the atmosphere', 'question': 'Changes in the tropospheric layer of this are what gives us weather', 'category': 'SCIENCE'}

0.35819104313850403
, BM25F_science_frequency:1, BM25F_science_propLength:1
{'answer': 'Sound barrier', 'question': 'In 70-degree air, a plane traveling at about 1,130 feet per second breaks it', 'category': 'SCIENCE'}

0.35819104313850403
, BM25F_the_frequency:2, BM25F_the_propLength:15, BM25F_science_frequency:1, BM25F_science_propLength:1
{'answer': 'DNA', 'question': 'In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance', 'category': 'SCIENCE'}

0.35819104313850403
, BM25F_science_propLength:1, BM25F_the_frequency:1, BM25F_the_propLength:18, BM25F_science_frequency:1
{'answer': 'species', 'question': "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification", 'category': 'SCIENCE'}

0.35819104313850403
, BM25F_science_frequency:1, BM25F_science_propLength:1, BM25F_the_frequency:1, BM25F_the_propLength:12
{'answer': 'Liver', 'question': 'This organ removes excess glucose from the blood & stores it as glycogen', 'category': 'SCIENCE'}

0.31266409158706665
, BM25F_is_propLength:14, BM25F_the_frequency:2, BM25F_the_propLength:14, BM25F_is_frequency:1
{'answer': 'Antelope', 'question': 'Weighing around a ton, the eland is the largest species of this animal in Africa', 'category': 'ANIMALS'}

0.1350332498550415
, BM25F_the_frequency:2, BM25F_the_propLength:9
{'answer': 'Elephant', 'question': "It's the only living mammal in the order Proboseidea", 'category': 'ANIMALS'}

0.11059693247079849
, BM25F_is_propLength:14, BM25F_the_frequency:2, BM25F_the_propLength:14, BM25F_is_frequency:1
{'answer': 'Antelope', 'question': 'Weighing around a ton, the eland is the largest species of this animal in Africa', 'category': 'ANIMALS'}

0.10673391073942184
, BM25F_the_frequency:2, BM25F_the_propLength:15, BM25F_science_frequency:1, BM25F_science_propLength:1
{'answer': 'DNA', 'question': 'In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance', 'category': 'SCIENCE'}

0.09976458549499512
, BM25F_the_frequency:2, BM25F_the_propLength:17
{'answer': 'the nose or snout', 'question': 'The gavial looks very much like a crocodile except for this bodily feature', 'category': 'ANIMALS'}

0.07754258811473846
, BM25F_the_propLength:12, BM25F_science_frequency:1, BM25F_science_propLength:1, BM25F_the_frequency:1
{'answer': 'Liver', 'question': 'This organ removes excess glucose from the blood & stores it as glycogen', 'category': 'SCIENCE'}

0.05944186821579933
, BM25F_the_frequency:1, BM25F_the_propLength:18, BM25F_science_frequency:1, BM25F_science_propLength:1
{'answer': 'species', 'question': "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification", 'category': 'SCIENCE'}

Example results of 3 different imports of the same dataset:

Answer + score Answer + score Answer + score
'wire' - 0.6708551645278931 'wire' - 0.6708551645278931 'wire' - 0.6708551645278931
'the diamondback rattler' - 0.5260506272315979 'the diamondback rattler' - 0.5260506272315979 'the diamondback rattler' - 0.5260506272315979
'the atmosphere' - 0.4687879681587219 'the atmosphere' - 0.4687879681587219 'the atmosphere' - 0.4687879681587219
'Antelope' - 0.42326104640960693 'DNA' - 0.35819104313850403 'Liver' - 0.4357336163520813
'species' - 0.41763290762901306 'Liver' - 0.35819104313850403 'Sound barrier' - 0.35819104313850403
'DNA' - 0.35819104313850403 'Sound barrier' - 0.35819104313850403 'species' - 0.35819104313850403
'Liver' - 0.35819104313850403 'species' - 0.35819104313850403 'DNA' - 0.35819104313850403
'Sound barrier' - 0.35819104313850403 'Antelope' - 0.31266409158706665 'Antelope' - 0.31266409158706665
'Elephant' - 0.1350332498550415 'Elephant' - 0.1350332498550415 'Elephant' - 0.1350332498550415
'DNA' - 0.10673391073942184 'Antelope' - 0.11059693247079849 'Antelope' - 0.11059693247079849
'the nose or snout' - 0.09976458549499512 'DNA' - 0.10673391073942184 'DNA' - 0.10673391073942184
'Liver' - 0.07754258811473846 'the nose or snout' - 0.09976458549499512 'the nose or snout' - 0.09976458549499512
'Liver' - 0.07754258811473846 'species' - 0.05944186821579933
'species' - 0.05944186821579933
12 results (2 duplicates) 14 results (4 duplicates) 13 results (3 duplicates)

Supporting information

When you dig deeper into the details you can find that when you query separate properties (i. e. 'category', 'question', 'answer') via query_properties parametr of bm25(), the results have consistent scores for each import and no duplicates occur. So the BM25 search itself works but the algorithm which fuses the score of each property into a score of the whole object is broken.

Also we don't think there is a problem with Stopwords Preset itself. More likely the problem lays in a keyword search for words with a high occurrence which is only hidden by Stopwords Preset set to EN.

The same problem occurs with larger datasets, the small dataset was chosen for brevity.

It might seem as a small bug but it prohibits any production use for non-English texts and any domain specific texts containing a high frequency of same words which are not defined as stopwords (e. g. legal texts).

  • Weaviate Server Version: 1.24.8
  • Deployment Method: WSC
  • Client Language and Version: Python weaviate-client 4.5.5

Forum thread: https://forum.weaviate.io/t/bug-duplicate-and-inconsistent-results-of-bm25-search/2065

Server Version

1.24.8

Code of Conduct

@Jero1970 Jero1970 added the bug label Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant