UTF-8 decoding validation #2026

bokken · 2021-01-18T18:03:15Z

bokken
Jan 18, 2021
Collaborator

In the Encoding class, the default implementation of decode(byte[], int, int) calls the String constructor with the Charset. This behavior will use a replacement character if any of the byte values are invalid for the specified Charset.
We have long had custom utf-8 decoding (now in OptimizedUTF8Encoder). This was originally used because it was significantly faster than java's decoding. In at least jdk 11+, this implementation is significantly slower for all ascii strings and only comparable when non-ascii characters are present. I included a benchmark in #2025 with some meaningful improvement in our customized decoding. Here are the results for just calling new String(bytes, off, len, StandardCharsets.UTF_8):

UTF8DecodeTest.newString          none      10  avgt    4    21.909 ±   0.999  ns/op
UTF8DecodeTest.newString          none      21  avgt    4    22.580 ±   0.526  ns/op
UTF8DecodeTest.newString          none      33  avgt    4    21.891 ±   0.395  ns/op
UTF8DecodeTest.newString          none      42  avgt    4    21.661 ±   0.293  ns/op
UTF8DecodeTest.newString          none      67  avgt    4    22.863 ±   0.395  ns/op
UTF8DecodeTest.newString          none      85  avgt    4    23.155 ±   0.734  ns/op
UTF8DecodeTest.newString          none     201  avgt    4    27.571 ±  13.151  ns/op
UTF8DecodeTest.newString          none     511  avgt    4    47.957 ±   2.816  ns/op
UTF8DecodeTest.newString          none    1027  avgt    4    87.463 ±   3.016  ns/op
UTF8DecodeTest.newString          none    4093  avgt    4   369.989 ±   8.802  ns/op
UTF8DecodeTest.newString         first      10  avgt    4    39.579 ±   2.625  ns/op
UTF8DecodeTest.newString         first      21  avgt    4    49.774 ±   1.974  ns/op
UTF8DecodeTest.newString         first      33  avgt    4    61.000 ±   1.266  ns/op
UTF8DecodeTest.newString         first      42  avgt    4    68.820 ±   0.910  ns/op
UTF8DecodeTest.newString         first      67  avgt    4    91.511 ±   3.317  ns/op
UTF8DecodeTest.newString         first      85  avgt    4   108.515 ±   3.216  ns/op
UTF8DecodeTest.newString         first     201  avgt    4   215.029 ±   6.801  ns/op
UTF8DecodeTest.newString         first     511  avgt    4   478.402 ±   7.537  ns/op
UTF8DecodeTest.newString         first    1027  avgt    4   986.645 ±  22.291  ns/op
UTF8DecodeTest.newString         first    4093  avgt    4  3838.688 ±  59.857  ns/op
UTF8DecodeTest.newString          last      10  avgt    4    42.017 ±   0.839  ns/op
UTF8DecodeTest.newString          last      21  avgt    4    43.879 ±   0.670  ns/op
UTF8DecodeTest.newString          last      33  avgt    4    50.578 ±   1.671  ns/op
UTF8DecodeTest.newString          last      42  avgt    4    57.402 ±   0.683  ns/op
UTF8DecodeTest.newString          last      67  avgt    4    75.590 ±   2.560  ns/op
UTF8DecodeTest.newString          last      85  avgt    4    88.322 ±   2.649  ns/op
UTF8DecodeTest.newString          last     201  avgt    4   166.938 ±   4.605  ns/op
UTF8DecodeTest.newString          last     511  avgt    4   430.017 ±  10.645  ns/op
UTF8DecodeTest.newString          last    1027  avgt    4   753.622 ±  26.744  ns/op
UTF8DecodeTest.newString          last    4093  avgt    4  2902.706 ±  21.854  ns/op

It seems like it should be pretty obvious to just start using the java decoding, but the change in behavior to not fail on invalid encoded Strings causes some unit test failures:
https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/test/java/org/postgresql/test/jdbc2/DatabaseEncodingTest.java

Is this error handling, unique to utf-8, really necessary?

bokken · 2021-01-18T18:13:34Z

bokken
Jan 18, 2021
Collaborator Author

I have a branch with what the changes could look like: https://github.com/bokken/pgjdbc/tree/java_utf8
The test failures will show here: https://github.com/bokken/pgjdbc/actions/runs/494258146

1 reply

bokken Jan 22, 2021
Collaborator Author

@vlsi do you have any history on this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 decoding validation #2026

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

UTF-8 decoding validation #2026

bokken Jan 18, 2021 Collaborator

Replies: 1 comment · 1 reply

bokken Jan 18, 2021 Collaborator Author

bokken Jan 22, 2021 Collaborator Author

bokken
Jan 18, 2021
Collaborator

Replies: 1 comment 1 reply

bokken
Jan 18, 2021
Collaborator Author

bokken Jan 22, 2021
Collaborator Author