New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-package] fix integer categorical features check in lgb.cv (fixes #6412) #6442
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The fix looks great. I really appreciate the thorough description... makes perfect sense to me.
I just left some comments about testing.
Could you please also add one more test in test_dataset.R
checking that the expected error message is actually raised under one of the conditions in that if
statement being modified here?
Similar to this one:
LightGBM/R-package/tests/testthat/test_dataset.R
Lines 620 to 622 in 6e78e69
expect_error({ | |
lgb.Dataset(raw_mat, categorical_feature = 2L)$construct() | |
}, regexp = "supplied a too large value in categorical_feature: 2 but only 1 features") |
@@ -3652,6 +3652,28 @@ test_that("lgb.cv() only prints eval metrics when expected to", { | |||
) | |||
}) | |||
|
|||
test_that("lgb.cv() works with an already constructed dataset with integer categoricals", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is testing code in lgb.Dataset
, would you consider moving it into test_dataset.R
? I think it's similar to this one:
test_that("lgb.Dataset: should be able to run lgb.cv() immediately after using lgb.Dataset() on a file", { |
, verbose = .LGB_VERBOSITY | ||
, num_threads = .LGB_MAX_THREADS | ||
) | ||
lgb.cv(params = params, data = dtrain, nrounds = 1L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add an assertion here that something happened? Just to give us a bit more confidence that this behavior didn't result in something like an early return with no training.
For example, maybe some of these checks that the returned objective contains metrics computed on the validation sets created by lgb.cv()
:
LightGBM/R-package/tests/testthat/test_basic.R
Lines 469 to 492 in 6e78e69
cv_bst <- lgb.cv( | |
data = dtrain | |
, nfold = 5L | |
, nrounds = nrounds | |
, params = list( | |
objective = "binary" | |
, metric = "auc,binary_error" | |
, learning_rate = 1.5 | |
, num_leaves = 5L | |
, verbose = .LGB_VERBOSITY | |
, num_threads = .LGB_MAX_THREADS | |
) | |
) | |
expect_true(methods::is(cv_bst, "lgb.CVBooster")) | |
expect_named( | |
cv_bst$record_evals | |
, c("start_iter", "valid") | |
, ignore.order = FALSE | |
, ignore.case = FALSE | |
) | |
auc_scores <- unlist(cv_bst$record_evals[["valid"]][["auc"]][["eval"]]) | |
expect_length(auc_scores, nrounds) | |
expect_identical(cv_bst$best_iter, which.max(auc_scores)) | |
expect_identical(cv_bst$best_score, auc_scores[which.max(auc_scores)]) |
Fixes #6412
When creating the folds for cv we slice the dataset, which just calls the constructor passing the same arguments and the
used_indices
. Iffree_raw_data = TRUE
then by the time this constructor is called there isn't raw data anymore, however, we also don't need to check if the categorical feature indices are out of bounds because this has already been checked when constructing the original dataset (and the categorical features can't be changed once the dataset has been constructed).This adds a check to see if we're building a subset and skips the validation. We can rely on that condition because it's checked afterwards
LightGBM/R-package/R/lgb.Dataset.R
Lines 202 to 212 in 6e78e69