Bigquery 2.9.16: Using this connector in Spark is resulting in all values in the spark dataframe being the column names #1245

dannnnthemannnn · 2023-05-31T06:45:07Z

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Please run down the following list and make sure you've tried the usual "quick fixes":

Search the issues already opened: https://github.com/googleapis/java-spanner-jdbc/issues
Check for answers on StackOverflow: http://stackoverflow.com/questions/tagged/google-cloud-platform

If you are still having issues, please include as much information as possible:

Environment details

Specify the API at the beginning of the title. For example, "BigQuery: ...").
General, Core, and Other are also allowed as types
OS type and version: Mac 12.0
Java version: 20.0.1
version(s):

Steps to reproduce

Hook this connector up in a spark job with the following code:

def querySpanner[T](sqlQuery: String)(implicit spark: SparkSession, enc: org.apache.spark.sql.Encoder[T]): Dataset[T] = {
  val url = "jdbc:cloudspanner:/projects/your-project-id/instances/your-instance-id/databases/your-database-id?credentials=$jsonKeyFilePath"

  // Read data using Spark
  val df = spark.read
    .format("jdbc")
    .option("url", url)
    .option("dbtable", "myTable")
    .option("driver", "com.google.cloud.spanner.jdbc.JdbcDriver")
    .load()

  // Convert DataFrame to Dataset
  df.as[T]
}

See that it is returning the values as if every row is the column name:
+----+---+
|name| id|
+----+---+
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
|name| id|
+----+---+
only showing top 20 rows

My query is: "SELECT name, id FROM OrgInfoV2"

Any additional information below

It seems similar to this issue:
https://stackoverflow.com/questions/66983401/spark-mariadb-jdbc-sql-query-returns-column-names-instead-of-column-values
or this one:
https://stackoverflow.com/questions/63177736/spark-read-as-jdbc-return-all-rows-as-columns-name

where it appears to be issues with the driver

Following these steps guarantees the quickest resolution possible.

Thanks!

olavloite · 2023-07-26T11:16:53Z

@dannnnthemannnn

I'm pretty sure that this is the same as for example sparklyr/sparklyr#3196

The problem is that Spark seems to generate a query that looks like this:

select "name", "id"
from OrgInfoV2

Double quotes are used for string literals in Cloud Spanner (and BigQuery). Sparks seems to think that it is a valid way to quote column names in case any of the column names contain any spaces or are equal to reserved keywords.

A possible workaround is probably to explicitly use the MySQL dialect for your connection. MySQL uses the same type of quoting as Cloud Spanner. See https://github.com/apache/spark/blob/071feabbd4325504332679dfa620bc5ee4359370/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala#L108

blunderbuss-gcf bot assigned rajatbhatta May 31, 2023

product-auto-label bot added the api: spanner Issues related to the googleapis/java-spanner-jdbc API. label May 31, 2023

dannnnthemannnn changed the title ~~Using this connector in Spark is resulting in all values in the spark dataframe being the column names~~ Bigquery 2.9.16: Using this connector in Spark is resulting in all values in the spark dataframe being the column names May 31, 2023

rajatbhatta assigned olavloite and unassigned rajatbhatta Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bigquery 2.9.16: Using this connector in Spark is resulting in all values in the spark dataframe being the column names #1245

Bigquery 2.9.16: Using this connector in Spark is resulting in all values in the spark dataframe being the column names #1245

dannnnthemannnn commented May 31, 2023 •

edited

olavloite commented Jul 26, 2023

Bigquery 2.9.16: Using this connector in Spark is resulting in all values in the spark dataframe being the column names #1245

Bigquery 2.9.16: Using this connector in Spark is resulting in all values in the spark dataframe being the column names #1245

Comments

dannnnthemannnn commented May 31, 2023 • edited

Environment details

Steps to reproduce

Any additional information below

olavloite commented Jul 26, 2023

dannnnthemannnn commented May 31, 2023 •

edited