Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String at com.salesforce.op.features.types.FeatureTypeSparkConverter$$anonfun$2.apply(FeatureTypeSparkConverter.scala:146) #520

Open
hjfrank1991 opened this issue Oct 19, 2020 · 9 comments

Comments

@hjfrank1991
Copy link

hjfrank1991 commented Oct 19, 2020

when i used iris.csv data:

1,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa

so i create StructType like this:

    val schema = StructType(
      Array(
        StructField("id", IntegerType, nullable = false),
        StructField("sepalLength", DoubleType, nullable = false).withComment("feature"),
        StructField("sepalWidth", DoubleType, nullable = false).withComment("feature"),
        StructField("petalLength", DoubleType, nullable = false).withComment("feature"),
        StructField("petalWidth", DoubleType, nullable = false).withComment("feature"),
        StructField("irisClass", StringType, nullable = false).withComment("label")
      )
    )

next i get label col and feature col:

val dataFrame = ...
val name = "irisClass"
val (irisClass, predictors)  = FeatureBuilder.fromDataFrame[Text](dataFrame, response = name)

id isn't label and feature when use this it means id is also a feature col , but i don't want this;
so i select cols comment is label or feature and then i drop other cols

val frame = dataFrame.drop("id")
val (irisClass, predictors)  = FeatureBuilder.fromDataFrame[Text](frame, response = name)

// Extract response and predictor Features
val (survived, predictors) = FeatureBuilder.fromDataFrame[Text](dataFrame, response = name)

// Automated feature engineering
val featureVector = predictors.transmogrify()

// Automated feature validation and selection
val index = survived.indexed("__unknown", StringIndexerHandleInvalid.Keep)

val checkedFeatures = index.sanityCheck(featureVector, removeBadFeatures = true)

val pred = MultiClassificationModelSelector
  //.withCrossValidation()
  .withTrainValidationSplit()
  .setInput(index, checkedFeatures)
  .setOutputFeatureName("pred")
  .getOutput()

// Setting up a TransmogrifAI workflow and training the model
val model: OpWorkflowModel = new OpWorkflow()
  .setInputDataset(frame)
  .setResultFeatures(pred)
  .train()

// save
model.save(path = "/model/automl", overwrite = true)

// load
val loadmodel = OpWorkflowModel.load("/model/automl")

// getAllFeatures
val features = loadmodel.getRawFeatures().map(_.name)

// use model to predict new data 
// Changing the order of columns
val frame3 = frame. select(features.head, features.tail: _*)
val dataFrame1 = loadmodel.setInputDataset(frame3)
  .score()
dataFrame1.show(false)

but get bug:

Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String
	at com.salesforce.op.features.types.FeatureTypeSparkConverter$$anonfun$2.apply(FeatureTypeSparkConverter.scala:146)
@hjfrank1991 hjfrank1991 changed the title Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String at com.salesforce.op.features.types.FeatureTypeSparkConverter$$anonfun$2.apply(FeatureTypeSparkConverter.scala:146) Oct 19, 2020
@hjfrank1991
Copy link
Author

hjfrank1991 commented Oct 19, 2020

if i change this:

val dataFrame1 = loadmodel.setInputDataset(frame)
.score()
dataFrame1.show(false)

it’s ok so when i use model to predict data i cann't change the order of columns ?

@tovbinm
Copy link
Collaborator

tovbinm commented Oct 19, 2020

In your example you seem does not seem to be using the frame you created. Try this:

// Drop id column
val frame = dataFrame.drop("id")

// Extract response and predictor Features
val (irisClass, predictors) = FeatureBuilder.fromDataFrame[Text](frame, response = "irisClass")

// Automated feature engineering
val featureVector = predictors.transmogrify()

// Automated feature validation and selection
val index = irisClass.indexed("__unknown", StringIndexerHandleInvalid.Keep)
val checkedFeatures = index.sanityCheck(featureVector, removeBadFeatures = true)

val pred = MultiClassificationModelSelector
  .withTrainValidationSplit()
  .setInput(index, checkedFeatures)
  .setOutputFeatureName("pred")
  .getOutput()

// Setting up a TransmogrifAI workflow and training the model
val model: OpWorkflowModel = new OpWorkflow()
  .setInputDataset(frame)
  .setResultFeatures(pred)
  .train()

val scored = model.setInputDataset(frame).score()

scored.show(false)

@hjfrank1991
Copy link
Author

hjfrank1991 commented Oct 19, 2020

sorry !write mistake。。。 this
in idea is right

// Extract response and predictor Features 
val (survived, predictors) = FeatureBuilder.fromDataFrame[Text](frame, response = name)

you example is right but when i change this frame ( change the order of columns rename frame_new) and then use model predict then have bug:

val scored = model.setInputDataset(frame_new).score()

so we predict data should keep the order of columns????

@hjfrank1991
Copy link
Author

hjfrank1991 commented Oct 19, 2020

and we can use this like sparkml pipeline example:

val (irisClass, predictors1) = FeatureBuilder.fromDataFrame[Text](dataFrame, response = name)
val strindex = new OpStringIndexer()
  .setInput(irisClass)
  .setOutputFeatureName("index")

val strModel = strindex.fit(dataFrame)
val mm = strModel.getSparkMlStage() match {
  case Some ( x ) => x
}

val opdt = new OpDecisionTreeClassifier()
  .setInput(strindex.getOutput(), featureVector1)
  .setOutputFeatureName("dtPred")

val labels = mm.labels

val inde = new OpIndexToString()
  .setInput(strindex.getOutput())
  .setLabels(labels)
  .setOutputFeatureName("pred")

val pipelineModel = new Pipeline("getAlgorithmType")
  .setStages(Array(strindex, opdt, inde))
  .fit(dataFrame)

do you have example like that?

@tovbinm
Copy link
Collaborator

tovbinm commented Oct 19, 2020

We never tried resorting to the columns. In general, this should not be an issue since we refer the columns by their names. Why would you need to do it?

Transmogrify stages can be used in Spark ML pipelines as long as you maintain the naming conventions on the columns.

@hjfrank1991
Copy link
Author

When we train the model, we use this model again to predict a batch of data, but the column order of this batch of data is different, and the column names are the same. If the order of the data columns read by the model cannot be changed, this reduces the generality

@tovbinm
Copy link
Collaborator

tovbinm commented Oct 20, 2020

OK, I just went through the code. Each Feature that was constructed from a Dataframe Row has an index property which is used to locate the feature column in each row.

One option I see to overcome this is to recreate the features prior scoring using the new dataset, then use them as input for the model.

@hjfrank1991
Copy link
Author

I don't quite understand; use new data sets to create features and then use the original model to predict

@hjfrank1991
Copy link
Author

when i use :
val features = loadmodel.getRawFeatures().map(_.name)
the order also changed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants