Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial commit: modified scala etls to accept Fannie Mae data #191

Merged
merged 31 commits into from Jul 25, 2022

Conversation

SurajAralihalli
Copy link
Collaborator

Signed-off-by: Suraj Aralihalli suraj.ara16@gmail.com

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
@viadea viadea requested a review from nvliyuan June 24, 2022 16:29
@SurajAralihalli SurajAralihalli marked this pull request as draft June 24, 2022 19:21
@nvliyuan
Copy link
Collaborator

please ignore the markdown links checker warnings, I file an issue for this.

@viadea
Copy link
Collaborator

viadea commented Jun 25, 2022

@SurajAralihalli @nvliyuan Could you also check why markdown link failed:

[✖] /docs/get-started/xgboost-examples/on-prem-cluster/standalone-scala.md → Status: 400 [Error: ENOENT: no such file or directory, access '/docs/get-started/xgboost-examples/on-prem-cluster/standalone-scala.md'] {
[56](https://github.com/NVIDIA/spark-rapids-examples/runs/7051224566?check_suite_focus=true#step:5:57)
  errno: -2,
[57](https://github.com/NVIDIA/spark-rapids-examples/runs/7051224566?check_suite_focus=true#step:5:58)
  code: 'ENOENT',
[58](https://github.com/NVIDIA/spark-rapids-examples/runs/7051224566?check_suite_focus=true#step:5:59)
  syscall: 'access',
[59](https://github.com/NVIDIA/spark-rapids-examples/runs/7051224566?check_suite_focus=true#step:5:60)
  path: '/docs/get-started/xgboost-examples/on-prem-cluster/standalone-scala.md'
[60](https://github.com/NVIDIA/spark-rapids-examples/runs/7051224566?check_suite_focus=true#step:5:61)
}

@nvliyuan
Copy link
Collaborator

nvliyuan commented Jun 27, 2022

Could you also check why markdown link failed:

the link is not dead, I believe it is a bug in the link checker.

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
@SurajAralihalli
Copy link
Collaborator Author

@viadea the link checker validates only http/https links and marks any references(links) to the files within the repo as failed (Eg: /docs/get-started/xgboost-examples/building-sample-apps/scala.md)

@nvliyuan
Copy link
Collaborator

@viadea the link checker validates only http/https links and marks any references(links) to the files within the repo as failed (Eg: /docs/get-started/xgboost-examples/building-sample-apps/scala.md)

let me give another push to see who can help on this issue

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
@@ -667,7 +780,7 @@
},
{
"cell_type": "code",
"execution_count": 60,
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since spark.rapids.sql.incompatibleOps.enabled is true by default now, maybe we can remove the config.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll update it in the next commit

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the key is to keep only non-default settings for spark rapids based on current GA version -- say 2206.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the latest commit to resolve the config issues in both scala and python etl notebooks

@nvliyuan
Copy link
Collaborator

since the output of mortgage-ETL.ipynb can be the input of mortgage-gpu.ipynb, it would be nice if we update the mortgage-gpu.ipynb about the data loading part. It will be something like:

df = spark.read.parquet("your etl notebook output path")
splits = df.randomSplit([0.8, 0.2])
train_data = splits[0]
trans_data = splits[1]

@nvliyuan
Copy link
Collaborator

@viadea shall we keep a sample mortgage dataset and add a declaration for convenience? So that customers do not need to get the dataset from the Fannie Mae website.

@viadea
Copy link
Collaborator

viadea commented Jul 13, 2022

I think we should not. And that is the same reason why we have this PR to use raw data.

@SurajAralihalli
Copy link
Collaborator Author

since the output of mortgage-ETL.ipynb can be the input of mortgage-gpu.ipynb, it would be nice if we update the mortgage-gpu.ipynb about the data loading part. It will be something like:

df = spark.read.parquet("your etl notebook output path")
splits = df.randomSplit([0.8, 0.2])
train_data = splits[0]
trans_data = splits[1]

Thanks, I have updated both scala and python etl notebooks to save train and test datasets based on a boolean saveTrainTestDataset parameter.

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
@nvliyuan nvliyuan self-requested a review July 15, 2022 06:11
nvliyuan
nvliyuan previously approved these changes Jul 15, 2022
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
@viadea
Copy link
Collaborator

viadea commented Jul 18, 2022

Some issues:

  1. In ETL part, when reading from raw CSV, there is no header in those CSV files.
    But in notebook I found scala/mortgage-ETL.ipynb:
val reader = sparkSession.read.option("header", true).schema(rawSchema)

We need to make sure the reading the raw CSV to use header=false.

  1. In the same ETL notebook, when it tries to write to Parquet file, it has a line:
val optionsMap = Map("header" -> "true")

Actually this optionsMap is never used. If so, we should remove this.

  1. Assuming the output of the ETL notebook are just parquet files, however in "scala/mortgage-gpu.ipynb", it is reading from csv. I hope we can change it to read from parquet instead such as the one in "scala/mortgage_gpu_crossvalidation.ipynb"

@viadea viadea self-requested a review July 18, 2022 22:42
Copy link
Collaborator

@viadea viadea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure ETL reading is reading CSV without header, and ETL writing is writing to parquet files in all notebooks(scala/python.)

@nvliyuan nvliyuan self-requested a review July 19, 2022 10:13
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
@viadea viadea self-requested a review July 20, 2022 22:15
" val sets = rawDF.randomSplit(Array[Double](0.8, 0.2))\n",
" val train = sets(0)\n",
" val eval = sets(1)\n",
" train.write.mode(\"overwrite\").parquet(new Path(outPath, \"train\").toString)\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we save the rawDF to parquet files, and then calculate rawDF again to do the 20-80 split.
Is this the best logic for performance?
Should we : Save the rawDF to parquet files, and then read it back to do the 20-80 split to save the calculation again?
Could you help test it this new approach's performance to see if this is the best logic?

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
@viadea viadea self-requested a review July 21, 2022 17:28
@nvliyuan nvliyuan self-requested a review July 25, 2022 05:44
@nvliyuan nvliyuan merged commit ac355c0 into NVIDIA:branch-22.08 Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants