Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: new text embedding for sparse vector #466

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

cutecutecat
Copy link
Member

@cutecutecat cutecutecat commented Apr 15, 2024

Part of #459

  • Changed svector text representation from "[0, 1, 0, 3, 4]" to pgvector-style "{2:1, 4:3, 5:4}/5"
  • Upgraded toolchain for new feature proc_macro_byte_character from upstream

Reminder

The index is from 1 instead of 0 at pgvector

Old New
[0, 1, 2, 0, 0] {1:1, 2:2}/5
[0, 1, 2, 3, 4, 5, 6, 7] {1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7}/8
[5, 6, 7] {0:5, 1:6, 2:7}/3
[0, 0, 0, 0] {}/4

@cutecutecat
Copy link
Member Author

cutecutecat commented Apr 15, 2024

The failed CI is due to an upstream uncompatiablity:

  • proc-macro2 v1.0.80 introduced proc_macro_byte_character feature, which is recently stablized but not released until 2024-04-12

@cutecutecat cutecutecat force-pushed the svec-new-text-rep branch 3 times, most recently from 2089054 to 0e0c58b Compare April 16, 2024 13:52
@cutecutecat cutecutecat marked this pull request as ready for review April 17, 2024 01:38
@usamoi
Copy link
Collaborator

usamoi commented Apr 18, 2024

The reason why CI fails is that sqllogictest-bin released on crates.io but not on GitHub a few days before (so cargo-binstall fallbacks to build natively). We do not have to update the toolchain.

src/datatype/text_svecf32.rs Outdated Show resolved Hide resolved
if *x != F32::zero() {
match need_splitter {
true => {
buffer.push_str(format!("{}:{}", i + 1, x).as_str());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel not good about indexing from 1. It's not consistent with subscripting.

@cutecutecat cutecutecat force-pushed the svec-new-text-rep branch 4 times, most recently from 69e527e to 2823e00 Compare April 19, 2024 01:31
@VoVAllen
Copy link
Member

Let's hold this PR for now, due to the conflict between 1-based array and 0-based array

@usamoi usamoi reopened this May 28, 2024
@usamoi usamoi force-pushed the svec-new-text-rep branch 2 times, most recently from f86433b to 2d6c196 Compare May 28, 2024 12:38
VoVAllen
VoVAllen previously approved these changes May 29, 2024
@VoVAllen
Copy link
Member

This is used to support bm25 extension. It can produce string instead of depending on pgvecto.rs/pgvector. cc @cutecutecat

src/utils/parse.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@usamoi usamoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing must accept all valid inputs and reject all invalid inputs. It should be better if it can be written trivially.

src/utils/parse.rs Outdated Show resolved Hide resolved
src/utils/parse.rs Outdated Show resolved Hide resolved
src/utils/parse.rs Outdated Show resolved Hide resolved
@cutecutecat cutecutecat force-pushed the svec-new-text-rep branch 4 times, most recently from e2d685d to 81b3a58 Compare June 4, 2024 09:19
@cutecutecat cutecutecat requested a review from usamoi June 5, 2024 01:26
Copy link
Collaborator

@usamoi usamoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about {}/1/2?

@cutecutecat cutecutecat force-pushed the svec-new-text-rep branch 2 times, most recently from ed1f8ad to 9a17411 Compare June 5, 2024 07:45
@cutecutecat
Copy link
Member Author

What about {}/1/2?

Fixed.

@cutecutecat cutecutecat requested a review from usamoi June 5, 2024 08:00
@usamoi
Copy link
Collaborator

usamoi commented Jun 5, 2024

{1,2,3,4}/5 gives incorrect message:

    OutOfBound {
        dims: 5,
        index: 4294967295,
    },

@usamoi
Copy link
Collaborator

usamoi commented Jun 5, 2024

{1:1,2:2,}/5 tailing commas are not accepted, but it should be

Copy link
Collaborator

@usamoi usamoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have defined ParseState, you can write the code in automata style.

@cutecutecat cutecutecat force-pushed the svec-new-text-rep branch 2 times, most recently from 7bbb0e8 to 746bb52 Compare June 5, 2024 10:05
@cutecutecat cutecutecat requested a review from usamoi June 5, 2024 10:23
cutecutecat and others added 6 commits June 7, 2024 10:32
Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
Signed-off-by: usamoi <usamoi@outlook.com>
Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants