Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode whitespace is removed from unquoted elements of text[] columns #1257

Open
kevinkyyro opened this issue Jul 18, 2018 · 7 comments
Open
Assignees

Comments

@kevinkyyro
Copy link

kevinkyyro commented Jul 18, 2018

} else if (!insideString && Character.isWhitespace(chars[i])) {
// white space
continue;

Versions

PgJDBC driver: 42.2.2
PostgreSQL server: 10.4 (Debian 10.4-2.pgdg90+1)
Java: 1.8.0_161-b12

Context

I have a table with a text[] column and wrote some property tests to verify selecting after inserting resulted in the same value, and I came across a strange issue.

For some reason, the server returns text[] columns without quoting elements that only contain non-ASCII UTF-8 whitespace (ex: \u2001). As a result, those elements have all their whitespace (UTF-8 or otherwise) stripped out at the linked line, which I identified by stepping through with the debugger.

I am reporting this as a driver bug because it seems fixable there, but it may also be undesirable behavior for the server.

SQL Example

# insert into test_strings (strings)  values (U&'{a\2001b, a b}'::varchar[]);
INSERT 0 1
# select strings from test_strings;
         strings         
-------------------------
 {a b,"a b"}

Code example

Here's an example based on my setup, which is using jooq (3.11.0).

It should be a pretty straight forward conversion to java 8, but in either case running this would require either wrapping a main method or throwing it into a script and getting jooq on the classpath. I'm happy to help with that, but I assume you will have a preferred way of doing that.

import org.jooq.impl.DSL
import org.jooq.DSLContext
import org.jooq.impl.SQLDataType
import org.jooq.SQLDialect
import java.sql.DriverManager

val username: String = "insight"
val password: String = "local"
val table: String = "example" // replace me

Class.forName("org.postgresql.Driver") // force class-loading
val db: DSLContext = DSL.using(
  DriverManager.getConnection("jdbc:postgresql://localhost:5432/insight", username, password),
  SQLDialect.POSTGRES)

db.dropTableIfExists(table).execute()

db.createTable(table)
  .as(db.select(
    DSL.value(
      Array("a\u2001b", "a b"),
      DSL.field("strings", SQLDataType.VARCHAR.getArrayDataType()))))
  .execute()

println(db.select().from(table).fetch())

The output should be as follows, demonstrating that the whitespace on the first array element has been removed (the printing format here varies from psql, so neither are quoted, but that's not a significant detail):

+---------+
|varchar  |
+---------+
|[ab, a b]|
+---------+
@kevinkyyro kevinkyyro changed the title unquoted elements in text[] column have whitespace removed unicode whitespace is removed from unquoted elements of text[] columns Jul 18, 2018
@vlsi
Copy link
Member

vlsi commented Jul 19, 2018

@kevinkyyro , nice catch.

it may also be undesirable behavior for the server.

+1

I'm happy to help with that, but I assume you will have a preferred way of doing that.

Could you share the tests?
We are fine to have tests written in Scala, etc.

For instance: https://github.com/pgjdbc/pgjdbc/tree/master/test-anorm-sbt

@vlsi
Copy link
Member

vlsi commented Jul 19, 2018

@bokken
Copy link
Member

bokken commented Jul 20, 2018

The comments there indicate there used to be different behavior, possibly similar to what java is doing now.
Should java driver now be updated to just consider the 6 specific characters as white space?

@bokken
Copy link
Member

bokken commented Jul 20, 2018

#1194 Includes unit tests for arrays. Adding some cases including non ascii whitespace should be fairly easy.

@davecramer
Copy link
Member

Here's the comments from the commit on the backend.

array_in discards unquoted leading and trailing whitespace in array values,
while array_out is careful to quote array elements that contain whitespace.
This is problematic when the definition of "whitespace" varies between
locales: array_in could drop characters that were meant to be part of the
value.  To avoid that, lock down "whitespace" to mean only the traditional
six ASCII space characters.

I think we either stick with the backend behaviour. Unless of course we have a convincing argument to change that as well ?

@bokken
Copy link
Member

bokken commented Jul 21, 2018

@vlsi I would propose that we change the java driver to only look for the 6 specific chars (PR #1266).

Would it also make sense to have the back end more liberally use quotes?
While we certainly do not want locale dependent behavior, the unicode whitespace definition is not locale dependent. Does the backend have easy access to the unicode definitions, specifically:

  • Spaces - General category "Zs" in the Unicode specification.
  • New lines - General category "Zl" in the Unicode specification.
  • Paragraphs - General category "Zp" in the Unicode specification.

Is there any downside to erring on side of more liberal quoting when returning values?

@kevinkyyro
Copy link
Author

Just want to point out that regardless of the set of characters considered whitespace, it says that it only trims leading and trailing whitespace on input. The driver should certainly not be doing anything to non-leading-or-trailing, non-escape characters, correct? It seems likely that no whitespace should be dropped at all on output, since it was already dropped on input. Looking at my example above, the delimiter seems to be a comma without any added space (hard to check for sure on my phone).

@davecramer davecramer self-assigned this Nov 28, 2019
davecramer pushed a commit that referenced this issue Aug 11, 2020
* fix: don't ignore unquoted non-ascii whitespace

The postgresql backend only quotes ascii whitespace.
#1257

add comment to existing isSpace method
change from switch to chained ORs

* test: fix comment in ArrayTest.testDirectFieldString

* fix: revert spaces in array literal

* add changelog entry

add test showing inner white spaces not dropped on unquoted strings by
backend.

Co-authored-by: Vladimir Sitnikov <sitnikov.vladimir@gmail.com>
Co-authored-by: BO8979 <BO8979@W1971362.northamerica.cerner.net>
davecramer pushed a commit to davecramer/pgjdbc that referenced this issue Aug 20, 2020
)

* fix: don't ignore unquoted non-ascii whitespace

The postgresql backend only quotes ascii whitespace.
pgjdbc#1257

add comment to existing isSpace method
change from switch to chained ORs

* test: fix comment in ArrayTest.testDirectFieldString

* fix: revert spaces in array literal

* add changelog entry

add test showing inner white spaces not dropped on unquoted strings by
backend.

Co-authored-by: Vladimir Sitnikov <sitnikov.vladimir@gmail.com>
Co-authored-by: BO8979 <BO8979@W1971362.northamerica.cerner.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants