Replies: 1 comment 1 reply
-
I have a branch with what the changes could look like: https://github.com/bokken/pgjdbc/tree/java_utf8 |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In the Encoding class, the default implementation of
decode(byte[], int, int)
calls theString
constructor with theCharset
. This behavior will use a replacement character if any of the byte values are invalid for the specifiedCharset
.We have long had custom utf-8 decoding (now in OptimizedUTF8Encoder). This was originally used because it was significantly faster than java's decoding. In at least jdk 11+, this implementation is significantly slower for all ascii strings and only comparable when non-ascii characters are present. I included a benchmark in #2025 with some meaningful improvement in our customized decoding. Here are the results for just calling
new String(bytes, off, len, StandardCharsets.UTF_8)
:It seems like it should be pretty obvious to just start using the java decoding, but the change in behavior to not fail on invalid encoded Strings causes some unit test failures:
https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/test/java/org/postgresql/test/jdbc2/DatabaseEncodingTest.java
Is this error handling, unique to utf-8, really necessary?
Beta Was this translation helpful? Give feedback.
All reactions