Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Got 'Provider "gs" not installed' on Dataproc #1348

Open
allan-silva opened this issue Feb 5, 2024 · 1 comment
Open

[Question] Got 'Provider "gs" not installed' on Dataproc #1348

allan-silva opened this issue Feb 5, 2024 · 1 comment
Labels
api: storage Issues related to the googleapis/java-storage-nio API. priority: p3 Desirable enhancement or fix. May not be included in next release. status: investigating The issue is under investigation, which is determined to be non-trivial. type: question Request for information or clarification. Not an issue.

Comments

@allan-silva
Copy link

Hi, I'm trying use the this lib to access data in a GCS bucket, from Dataproc spark job.

Up to now I try:

  • add this lib as dependency on my scala 2.12 project
    libraryDependencies ++= Seq(
      "com.google.cloud" % "google-cloud-nio" % "0.123.10",
      "org.apache.spark" %% "spark-core" % "3.5.0" % "provided",
      "org.apache.spark" %% "spark-sql" % "3.5.0" % "provided",
      "br.dev.contrib.gov.sus.opendata" % "libdatasus-parquet-dbf" % "1.0.5" % "provided"
    ),
  • Pass com.google.cloud:google-cloud-nio:0.123.10 (the most new version too), as --packages parameter for spark job.
  • Send google-cloud-nio jar via --jars spark parameters
  • even try load SP manually
ServiceLoader.load(classOf[CloudStorageFileSystemProvider])

Reading the README, looks like I need only add this lib as dependency. Is supposed I need to do any other step?

I always got "Provider "gs" not installed'" from dataproc job.

      val sourceFileURI = URI.create(row.getAs[String]("file_uri"))
     ...
      val outputFileURI = URI.create(s"$outputBucket/${sourceFileHadoopPath.getName}.parquet")

      val converter = DbfParquet.builder().build()

      converter.convert(
        Paths.get(sourceFileURI),
        Paths.get(outputFileURI)
      )

Results in:

24/02/05 22:25:29 INFO BigQueryDataSourceReaderContext: Got read session for GenericData{classInfo=[datasetId, projectId, tableId], {datasetId=ingestion_info, projectId=puc-tcc-412315, tableId=_bqc_b89f019aa87446dd960bcb0c3ace5ff2}}: projects/puc-tcc-412315/locations/us/sessions/CAISDGZtVGJxdGlQaTR1QhoCcHoaAnB4 for application id: application_1707171685672_0002
+-------------------------------------------------+------+
|file_uri                                         |source|
+-------------------------------------------------+------+
|gs://informacoes-ambulatoriais-raw/CIHASE1310.dbc|SIA   |
|gs://informacoes-ambulatoriais-raw/CIHADF1206.dbc|SIA   |
+-------------------------------------------------+------+

> ^^^ Files to be processed
24/02/05 22:25:39 INFO ReadSessionCreator: Reusing read session: projects/puc-tcc-412315/locations/us/sessions/CAISDGZtVGJxdGlQaTR1QhoCcHoaAnB4, for table: GenericData{classInfo=[datasetId, projectId, tableId], {datasetId=ingestion_info, projectId=puc-tcc-412315, tableId=_bqc_b89f019aa87446dd960bcb0c3ace5ff2}}
24/02/05 22:25:39 INFO BigQueryDataSourceReaderContext: Got read session for GenericData{classInfo=[datasetId, projectId, tableId], {datasetId=ingestion_info, projectId=puc-tcc-412315, tableId=_bqc_b89f019aa87446dd960bcb0c3ace5ff2}}: projects/puc-tcc-412315/locations/us/sessions/CAISDGZtVGJxdGlQaTR1QhoCcHoaAnB4 for application id: application_1707171685672_0002
24/02/05 22:25:43 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (informacoes-ambulatoriais-r4u3ielni3jbu-w-0.us-central1-c.c.puc-tcc-412315.internal executor 1): java.nio.file.FileSystemNotFoundException: Provider "gs" not installed
	at java.base/java.nio.file.Path.of(Path.java:212)
	at java.base/java.nio.file.Paths.get(Paths.java:97)
	at br.dev.contrib.gov.sus.opendata.jobs.FileConversionJob$.$anonfun$convertFiles$1(FileConversionJob.scala:68)
@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/java-storage-nio API. label Feb 5, 2024
@cojenco cojenco added type: question Request for information or clarification. Not an issue. priority: p3 Desirable enhancement or fix. May not be included in next release. status: investigating The issue is under investigation, which is determined to be non-trivial. labels Feb 6, 2024
@cojenco
Copy link
Contributor

cojenco commented Feb 13, 2024

Hi allan-silva@ based on the error message, this seems to be an issue with a dependency missing or not being packaged in the required way. Please check out how similar issues were resolved. Hope these previous discussions will help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/java-storage-nio API. priority: p3 Desirable enhancement or fix. May not be included in next release. status: investigating The issue is under investigation, which is determined to be non-trivial. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

2 participants