initial commit: modified scala etls to accept Fannie Mae data #191

SurajAralihalli · 2022-06-24T14:33:19Z

Signed-off-by: Suraj Aralihalli suraj.ara16@gmail.com

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

nvliyuan · 2022-06-25T03:23:57Z

please ignore the markdown links checker warnings, I file an issue for this.

viadea · 2022-06-25T04:48:34Z

@SurajAralihalli @nvliyuan Could you also check why markdown link failed:

[✖] /docs/get-started/xgboost-examples/on-prem-cluster/standalone-scala.md → Status: 400 [Error: ENOENT: no such file or directory, access '/docs/get-started/xgboost-examples/on-prem-cluster/standalone-scala.md'] {
[56](https://github.com/NVIDIA/spark-rapids-examples/runs/7051224566?check_suite_focus=true#step:5:57)
  errno: -2,
[57](https://github.com/NVIDIA/spark-rapids-examples/runs/7051224566?check_suite_focus=true#step:5:58)
  code: 'ENOENT',
[58](https://github.com/NVIDIA/spark-rapids-examples/runs/7051224566?check_suite_focus=true#step:5:59)
  syscall: 'access',
[59](https://github.com/NVIDIA/spark-rapids-examples/runs/7051224566?check_suite_focus=true#step:5:60)
  path: '/docs/get-started/xgboost-examples/on-prem-cluster/standalone-scala.md'
[60](https://github.com/NVIDIA/spark-rapids-examples/runs/7051224566?check_suite_focus=true#step:5:61)
}

nvliyuan · 2022-06-27T01:45:25Z

Could you also check why markdown link failed:

the link is not dead, I believe it is a bug in the link checker.

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

…LChange

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

…LChange

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

SurajAralihalli · 2022-07-11T17:14:34Z

@viadea the link checker validates only http/https links and marks any references(links) to the files within the repo as failed (Eg: /docs/get-started/xgboost-examples/building-sample-apps/scala.md)

examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL.ipynb

nvliyuan · 2022-07-12T09:59:33Z

@viadea the link checker validates only http/https links and marks any references(links) to the files within the repo as failed (Eg: /docs/get-started/xgboost-examples/building-sample-apps/scala.md)

let me give another push to see who can help on this issue

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

nvliyuan · 2022-07-13T07:07:36Z

examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL.ipynb

@@ -667,7 +780,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 60,
+   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [


since spark.rapids.sql.incompatibleOps.enabled is true by default now, maybe we can remove the config.

Sure, I'll update it in the next commit

yes the key is to keep only non-default settings for spark rapids based on current GA version -- say 2206.

I have updated the latest commit to resolve the config issues in both scala and python etl notebooks

nvliyuan · 2022-07-13T08:57:06Z

since the output of mortgage-ETL.ipynb can be the input of mortgage-gpu.ipynb, it would be nice if we update the mortgage-gpu.ipynb about the data loading part. It will be something like:

df = spark.read.parquet("your etl notebook output path")
splits = df.randomSplit([0.8, 0.2])
train_data = splits[0]
trans_data = splits[1]

nvliyuan · 2022-07-13T09:08:41Z

@viadea shall we keep a sample mortgage dataset and add a declaration for convenience? So that customers do not need to get the dataset from the Fannie Mae website.

viadea · 2022-07-13T16:10:36Z

I think we should not. And that is the same reason why we have this PR to use raw data.

SurajAralihalli · 2022-07-14T15:39:19Z

since the output of mortgage-ETL.ipynb can be the input of mortgage-gpu.ipynb, it would be nice if we update the mortgage-gpu.ipynb about the data loading part. It will be something like:
df = spark.read.parquet("your etl notebook output path")
splits = df.randomSplit([0.8, 0.2])
train_data = splits[0]
trans_data = splits[1]

Thanks, I have updated both scala and python etl notebooks to save train and test datasets based on a boolean saveTrainTestDataset parameter.

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

viadea · 2022-07-18T22:16:48Z

Some issues:

In ETL part, when reading from raw CSV, there is no header in those CSV files.
But in notebook I found scala/mortgage-ETL.ipynb:

val reader = sparkSession.read.option("header", true).schema(rawSchema)

We need to make sure the reading the raw CSV to use header=false.

In the same ETL notebook, when it tries to write to Parquet file, it has a line:

val optionsMap = Map("header" -> "true")

Actually this optionsMap is never used. If so, we should remove this.

Assuming the output of the ETL notebook are just parquet files, however in "scala/mortgage-gpu.ipynb", it is reading from csv. I hope we can change it to read from parquet instead such as the one in "scala/mortgage_gpu_crossvalidation.ipynb"

viadea

Make sure ETL reading is reading CSV without header, and ETL writing is writing to parquet files in all notebooks(scala/python.)

examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL.ipynb

examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL+XGBoost.ipynb

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

viadea · 2022-07-20T22:19:55Z

examples/XGBoost-Examples/mortgage/notebooks/scala/mortgage-ETL.ipynb

+    "  val sets = rawDF.randomSplit(Array[Double](0.8, 0.2))\n",
+    "  val train = sets(0)\n",
+    "  val eval = sets(1)\n",
+    "  train.write.mode(\"overwrite\").parquet(new Path(outPath, \"train\").toString)\n",


Here we save the rawDF to parquet files, and then calculate rawDF again to do the 20-80 split.
Is this the best logic for performance?
Should we : Save the rawDF to parquet files, and then read it back to do the 20-80 split to save the calculation again?
Could you help test it this new approach's performance to see if this is the best logic?

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

initial commit: modified scala etls to accept Fannie Mae data

e4bb22d

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

viadea requested a review from nvliyuan June 24, 2022 16:29

SurajAralihalli marked this pull request as draft June 24, 2022 19:21

SurajAralihalli added 14 commits June 27, 2022 14:09

updated pyspark etls to consume raw mortgage data

e69e157

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

updated pyspark application docs

5d6f5b3

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

updated scala spark application docs

d0cb491

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

updated MortgageETL.ipynb notebook

84633d2

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

Merge remote-tracking branch 'upstream/branch-22.08' into fannieMaeET…

836e601

…LChange

fixed accuracy issue in scala and python ETLs

9e74d79

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

tested MortgageETL

4bee25e

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

added scala notebook etl

6f5e882

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

updated MortgageETL+XGBoost.ipynb

911c7aa

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

fix bugs in docs

67d68cd

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

Merge remote-tracking branch 'upstream/branch-22.08' into fannieMaeET…

674d318

…LChange

update docs to reflect fannie mae data

1cbbc30

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

updated ipynb files with future links

ac9c08b

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

link updated

0458ccc

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

SurajAralihalli marked this pull request as ready for review July 7, 2022 20:01

nvliyuan mentioned this pull request Jul 11, 2022

refine cupsatial demo to make it more clear for customers #187

Merged

nvliyuan reviewed Jul 12, 2022

View reviewed changes

examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL.ipynb Outdated Show resolved Hide resolved

remove maxPartitionBytes

07e661d

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

nvliyuan reviewed Jul 13, 2022

View reviewed changes

SurajAralihalli added 2 commits July 14, 2022 12:39

added instructions to download dataset

97c1c46

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

modified readme files to reflect config changes

4d3d30b

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

nvliyuan self-requested a review July 15, 2022 06:11

nvliyuan previously approved these changes Jul 15, 2022

View reviewed changes

fixed a bug in utility Mortgage.scala

a6f3fdb

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

viadea self-requested a review July 18, 2022 22:42

viadea requested changes Jul 18, 2022

View reviewed changes

nvliyuan self-requested a review July 19, 2022 10:13

nvliyuan reviewed Jul 19, 2022

View reviewed changes

examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL.ipynb Outdated Show resolved Hide resolved

nvliyuan reviewed Jul 19, 2022

View reviewed changes

examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL+XGBoost.ipynb Outdated Show resolved Hide resolved

SurajAralihalli added 5 commits July 20, 2022 10:57

fixed scala application bus

4dfff4b

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

fixed python spark application bugs

834c71c

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

added cpu etl section in readMe

aed50c8

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

fixed scala notebooks

af6e966

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

fixed python notebooks

1c04ffb

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

SurajAralihalli dismissed nvliyuan’s stale review via 1c04ffb July 20, 2022 17:59

SurajAralihalli added 3 commits July 20, 2022 14:35

added step to run on CPU in scala notebook etl

3d4c487

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

fixed cv scala notebook bug

18629e4

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

improve documentation

173ad5d

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

viadea self-requested a review July 20, 2022 22:15

viadea requested changes Jul 20, 2022

View reviewed changes

read data from disk before random split

788dc20

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>

viadea self-requested a review July 21, 2022 17:28

viadea approved these changes Jul 21, 2022

View reviewed changes

nvliyuan self-requested a review July 25, 2022 05:44

nvliyuan approved these changes Jul 25, 2022

View reviewed changes

nvliyuan merged commit ac355c0 into NVIDIA:branch-22.08 Jul 25, 2022

nvliyuan mentioned this pull request Jul 25, 2022

out-of-date configs in mortgage example #196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial commit: modified scala etls to accept Fannie Mae data #191

initial commit: modified scala etls to accept Fannie Mae data #191

SurajAralihalli commented Jun 24, 2022

nvliyuan commented Jun 25, 2022

viadea commented Jun 25, 2022

nvliyuan commented Jun 27, 2022 •

edited

SurajAralihalli commented Jul 11, 2022

nvliyuan commented Jul 12, 2022

nvliyuan Jul 13, 2022

SurajAralihalli Jul 13, 2022

viadea Jul 13, 2022

SurajAralihalli Jul 14, 2022 •

edited

nvliyuan commented Jul 13, 2022

nvliyuan commented Jul 13, 2022

viadea commented Jul 13, 2022

SurajAralihalli commented Jul 14, 2022

viadea commented Jul 18, 2022 •

edited

viadea left a comment

viadea Jul 20, 2022

initial commit: modified scala etls to accept Fannie Mae data #191

initial commit: modified scala etls to accept Fannie Mae data #191

Conversation

SurajAralihalli commented Jun 24, 2022

nvliyuan commented Jun 25, 2022

viadea commented Jun 25, 2022

nvliyuan commented Jun 27, 2022 • edited

SurajAralihalli commented Jul 11, 2022

nvliyuan commented Jul 12, 2022

nvliyuan Jul 13, 2022

Choose a reason for hiding this comment

SurajAralihalli Jul 13, 2022

Choose a reason for hiding this comment

viadea Jul 13, 2022

Choose a reason for hiding this comment

SurajAralihalli Jul 14, 2022 • edited

Choose a reason for hiding this comment

nvliyuan commented Jul 13, 2022

nvliyuan commented Jul 13, 2022

viadea commented Jul 13, 2022

SurajAralihalli commented Jul 14, 2022

viadea commented Jul 18, 2022 • edited

viadea left a comment

Choose a reason for hiding this comment

viadea Jul 20, 2022

Choose a reason for hiding this comment

nvliyuan commented Jun 27, 2022 •

edited

SurajAralihalli Jul 14, 2022 •

edited

viadea commented Jul 18, 2022 •

edited