[R-package] fix integer categorical features check in lgb.cv (fixes #6412) #6442

jmoralez · 2024-05-04T04:41:32Z

When creating the folds for cv we slice the dataset, which just calls the constructor passing the same arguments and the used_indices. If free_raw_data = TRUE then by the time this constructor is called there isn't raw data anymore, however, we also don't need to check if the categorical feature indices are out of bounds because this has already been checked when constructing the original dataset (and the categorical features can't be changed once the dataset has been constructed).

This adds a check to see if we're building a subset and skips the validation. We can rely on that condition because it's checked afterwards

LightGBM/R-package/R/lgb.Dataset.R

Lines 202 to 212 in 6e78e69

    
           # not subsetting, constructing from raw data 
        
           if (is.null(private$used_indices)) { 
        
             if (is.null(private$raw_data)) { 
        
               stop(paste0( 
        
                 "Attempting to create a Dataset without any raw data. " 
        
                 , "This can happen if you have called Dataset$finalize() or if this Dataset was saved with saveRDS(). " 
        
                 , "To avoid this error in the future, use lgb.Dataset.save() or " 
        
                 , "Dataset$save_binary() to save lightgbm Datasets." 
        
               )) 
        
             }

jameslamb

Thanks! The fix looks great. I really appreciate the thorough description... makes perfect sense to me.

I just left some comments about testing.

Could you please also add one more test in test_dataset.R checking that the expected error message is actually raised under one of the conditions in that if statement being modified here?

Similar to this one:

LightGBM/R-package/tests/testthat/test_dataset.R

Lines 620 to 622 in 6e78e69

    
           expect_error({ 
        
             lgb.Dataset(raw_mat, categorical_feature = 2L)$construct() 
        
           }, regexp = "supplied a too large value in categorical_feature: 2 but only 1 features")

jameslamb · 2024-05-04T21:20:59Z

R-package/tests/testthat/test_basic.R

@@ -3652,6 +3652,28 @@ test_that("lgb.cv() only prints eval metrics when expected to", {
  )
 })

+test_that("lgb.cv() works with an already constructed dataset with integer categoricals", {


Since this is testing code in lgb.Dataset, would you consider moving it into test_dataset.R? I think it's similar to this one:

LightGBM/R-package/tests/testthat/test_dataset.R

Line 407 in 6e78e69

test_that("lgb.Dataset: should be able to run lgb.cv() immediately after using lgb.Dataset() on a file", {

jameslamb · 2024-05-04T21:24:02Z

R-package/tests/testthat/test_basic.R

+    , verbose = .LGB_VERBOSITY
+    , num_threads = .LGB_MAX_THREADS
+  )
+  lgb.cv(params = params, data = dtrain, nrounds = 1L)


could you add an assertion here that something happened? Just to give us a bit more confidence that this behavior didn't result in something like an early return with no training.

For example, maybe some of these checks that the returned objective contains metrics computed on the validation sets created by lgb.cv():

LightGBM/R-package/tests/testthat/test_basic.R

Lines 469 to 492 in 6e78e69

cv_bst <- lgb.cv(

data = dtrain

, nfold = 5L

, nrounds = nrounds

, params = list(

objective = "binary"

, metric = "auc,binary_error"

, learning_rate = 1.5

, num_leaves = 5L

, verbose = .LGB_VERBOSITY

, num_threads = .LGB_MAX_THREADS

)

)

expect_true(methods::is(cv_bst, "lgb.CVBooster"))

expect_named(

cv_bst$record_evals

, c("start_iter", "valid")

, ignore.order = FALSE

, ignore.case = FALSE

)

auc_scores <- unlist(cv_bst$record_evals[["valid"]][["auc"]][["eval"]])

expect_length(auc_scores, nrounds)

expect_identical(cv_bst$best_iter, which.max(auc_scores))

expect_identical(cv_bst$best_score, auc_scores[which.max(auc_scores)])

fix categorical features for dataset in cv

6305b8d

jmoralez changed the title ~~fix categorical features for dataset in cv~~ [R-package] fix categorical features for dataset in cv May 4, 2024

jmoralez changed the title ~~[R-package] fix categorical features for dataset in cv~~ [R-package] fix integer categorical features check for dataset in lgb.cv May 4, 2024

jmoralez added the fix label May 4, 2024

jmoralez changed the title ~~[R-package] fix integer categorical features check for dataset in lgb.cv~~ [R-package] fix integer categorical features check in lgb.cv May 4, 2024

lint

fe4be1d

jmoralez changed the title ~~[R-package] fix integer categorical features check in lgb.cv~~ [R-package] fix integer categorical features check in lgb.cv (fixes #6412) May 4, 2024

jmoralez marked this pull request as ready for review May 4, 2024 17:49

jmoralez requested review from guolinke, jameslamb, shiyu1994 and borchero as code owners May 4, 2024 17:49

jameslamb requested changes May 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-package] fix integer categorical features check in lgb.cv (fixes #6412) #6442

[R-package] fix integer categorical features check in lgb.cv (fixes #6412) #6442

jmoralez commented May 4, 2024

jameslamb left a comment

jameslamb May 4, 2024

jameslamb May 4, 2024

	# not subsetting, constructing from raw data
	if (is.null(private$used_indices)) {

	if (is.null(private$raw_data)) {
	stop(paste0(
	"Attempting to create a Dataset without any raw data. "
	, "This can happen if you have called Dataset$finalize() or if this Dataset was saved with saveRDS(). "
	, "To avoid this error in the future, use lgb.Dataset.save() or "
	, "Dataset$save_binary() to save lightgbm Datasets."
	))
	}

	expect_error({
	lgb.Dataset(raw_mat, categorical_feature = 2L)$construct()
	}, regexp = "supplied a too large value in categorical_feature: 2 but only 1 features")

	cv_bst <- lgb.cv(
	data = dtrain
	, nfold = 5L
	, nrounds = nrounds
	, params = list(
	objective = "binary"
	, metric = "auc,binary_error"
	, learning_rate = 1.5
	, num_leaves = 5L
	, verbose = .LGB_VERBOSITY
	, num_threads = .LGB_MAX_THREADS
	)
	)
	expect_true(methods::is(cv_bst, "lgb.CVBooster"))
	expect_named(
	cv_bst$record_evals
	, c("start_iter", "valid")
	, ignore.order = FALSE
	, ignore.case = FALSE
	)
	auc_scores <- unlist(cv_bst$record_evals[["valid"]][["auc"]][["eval"]])
	expect_length(auc_scores, nrounds)
	expect_identical(cv_bst$best_iter, which.max(auc_scores))
	expect_identical(cv_bst$best_score, auc_scores[which.max(auc_scores)])

[R-package] fix integer categorical features check in lgb.cv (fixes #6412) #6442

Are you sure you want to change the base?

[R-package] fix integer categorical features check in lgb.cv (fixes #6412) #6442

Conversation

jmoralez commented May 4, 2024

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb May 4, 2024

Choose a reason for hiding this comment

jameslamb May 4, 2024

Choose a reason for hiding this comment