Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Catetegorify can't process vocabs correctly when num_buckets>1 #1857

Open
fedaeho opened this issue Aug 1, 2023 · 1 comment
Open
Labels
bug Something isn't working

Comments

@fedaeho
Copy link

fedaeho commented Aug 1, 2023

Describe the bug
nvt.ops.Categorify don't process vocabs correctly when num_buckets>1 is given simultaneously.

Steps/Code to reproduce bug

I tried to use categorify transform with pre-defined vocabs.
I also have to consider multiple oov, so I also gives num_buckets>1 for parameter.

from merlin.core import dispatch
import pandas as pd
import nvtabular as nvt

df = dispatch.make_df(
        {
            "Authors": [["User_A"], ["User_A", "User_E"], ["User_B", "User_C"], []],
            "Post": [1, 2, 3, 4],
        }
    )

cat_names = ["Authors"]
label_name = ["Post"]

vocabs = {"Authors": pd.Series([f"User_{x}" for x in "ACBE"])}
cat_features = cat_names >> nvt.ops.Categorify(
    num_buckets=2, vocabs=vocabs, max_size = {"Authors": 8},
)

workflow = nvt.Workflow(cat_features + label_name)
df_out = workflow.fit_transform(nvt.Dataset(df)).to_ddf().compute()

For above code, expected index for each values are like below.

  • pad: [0]
  • null: [1]
  • oov : [2,3]
  • unique: [4,5,6,7]).

But, I get following result with wrong category dictionary.

  • df_out
Authors Post
0 [7] 1
1 [ 7 10] 2
2 [9 8] 3
3 [] 4
  • pd.read_parquet("./categories/meta.Authors.parquet")
kind offset num_indices
0 pad 0 1
1 null 1 1
2 oov 2 1
3 unique 3 4
  • pd.read_parquet("./categories/unique.Authors.parquet")
Authors
3 User_A
4 User_C
5 User_B
6 User_E

I check inside of Categorify.process_vocabs function and oov_count can get num_buckets correctly.
But when process_vocabs function call Categorify._save_encodings(), it doesn't make the vocabulary dictionary correctly.

Expected behavior
From

if num_buckets:
oov_count = (
num_buckets if isinstance(num_buckets, int) else num_buckets[col_name]
) or 1
col_df = dispatch.make_df(vals).dropna()
col_df.index += NULL_OFFSET + oov_count
save_path = _save_encodings(col_df, base_path, col_name)

I fix the code whereprocess_vocab call Categorify._save_encodings with oov_count.

    def process_vocabs(self, vocabs):
      ...
                oov_count = 1
                if num_buckets:
                    oov_count = (
                        num_buckets if isinstance(num_buckets, int) else num_buckets[col_name]
                    ) or 1
                col_df = dispatch.make_df(vals).dropna()
                col_df.index += NULL_OFFSET + oov_count
                # before
                # save_path = _save_encodings(col_df, base_path, col_name, oov_count=oov_count)
                # after
                save_path = _save_encodings(col_df, base_path, col_name, oov_count=oov_count)

and I got following result of df_out like as I expected.

Authors Post
0 [4] 1
1 [ 4 7] 2
2 [6 5] 3
3 [] 4

Environment details (please complete the following information):

  • Environment location: Bare-metal (CentOS 7)
  • Method of NVTabular install: pip

Additional context
None

@fedaeho fedaeho added the bug Something isn't working label Aug 1, 2023
@EvenOldridge
Copy link
Member

In all of the applications I've built OOV has been a single embedding and used to represent the fact that the item is new or rare. Can you help me understand the use case? Why would you want multiple OOV values. They're so rare that they'll effectively end up as random embeddings. Grouping them gives you some information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants