Behavior of HEX function is different between dialects #3460

viplazylmht · 2024-05-11T07:56:31Z

Problem:

HEX function encodes a bytes or string into a hexadecimal encoding, but the result of the function is vary between dialects, due to the case of the result.

For example,
in bigquery (lowercase):

select to_hex(MD5('Hello'))
|  Row  |              f0_            |  
| ----- | --------------------------- |
|   1   | 8b1a9953c4611296a827abf8c47 |

in presto/trino (uppercase):

presto> select to_hex(to_utf8('some string'));
         _col0          
------------------------
 736F6D6520737472696E67 
(1 row)

in clickhouse return uppercase.
in spark (uppercase), but hex(md5_digest(x)) produce a lowercase result:

>>> spark.sql("select hex('Hello')").collect()
[Row(hex(Hello)='48656C6C6F')]
>>> spark.sql("select md5('Hello')").collect()
[Row(md5(Hello)='8b1a9953c4611296a827abf8c47804d7')]

In snowflake, the default behavior is uppercase, but we can pass an additional argument to control the case: https://docs.snowflake.com/en/sql-reference/functions/hex_encode#argumentshttps://docs.snowflake.com/en/sql-reference/functions/hex_encode#arguments

And so on...

=> So, as a transpiler tool, should we ensure the behavior of this function while transpiling?

The solution if needed:

We can add a new exp.UpperHex (or exp.LowerHex, depends on the case of default exp.Hex).
Also, we add a flag to Dialect class, which mark the behavior of HEX function of the target dialect is uppercase or lowercase, then can safety and can simplify the SQL while parsing or generating the sql.

For example, if read dialect is bigquery (lowercase), write dialect is trino (uppercase), the following step will happen when transpiling the sql select lower(to_hex(col_x)):

Parse sql to bigquery: becasue bq are already lowercased, the tree will look like: exp.SELECT (exp.Hex( col_x ))
Generate trino sql: because trino is uppercased, generator will be generating SELECT LOWER(TO_HEX(col_x))

If we transpile select upper(to_hex(col_x)) from bigquery to trino, the code will work like this:

parse sql to exp.Select( exp.UpperHex( col_x ))
Generate trinno sql: SELECT TO_HEX(col_x)

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

georgesittas · 2024-05-11T13:23:51Z

Thanks for the reports / PRs @viplazylmht, these make sense. Perhaps we could only add a single UpperHex node and treat the current one as LowerHex? Would you like to work on this?

viplazylmht · 2024-05-11T18:14:31Z

@georgesittas Thanks, let me finish this

viplazylmht · 2024-05-12T03:58:31Z

I found hive dialect is uppercase too. Spark inherits spark2, which inherits hive, all are the same. So, maybe most of dialects are in uppercase.

0: jdbc:hive2://localhost:10000/> select hex('Hello');
INFO  : Compiling command(queryId=hive_20240512035403_4f5122ef-45ac-4a13-a743-99d0b69736a3): select hex('Hello')
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20240512035403_4f5122ef-45ac-4a13-a743-99d0b69736a3); Time taken: 0.151 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20240512035403_4f5122ef-45ac-4a13-a743-99d0b69736a3): select hex('Hello')
INFO  : Completed executing command(queryId=hive_20240512035403_4f5122ef-45ac-4a13-a743-99d0b69736a3); Time taken: 0.001 seconds
+-------------+
|     _c0     |
+-------------+
| 48656C6C6F  |
+-------------+
1 row selected (0.176 seconds)
0: jdbc:hive2://localhost:10000/>

viplazylmht mentioned this issue May 12, 2024

feat(transpiler): handle different hex behavior for dialects #3463

Merged

georgesittas closed this as completed in #3463 May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior of HEX function is different between dialects #3460

Behavior of HEX function is different between dialects #3460

viplazylmht commented May 11, 2024

georgesittas commented May 11, 2024

viplazylmht commented May 11, 2024

viplazylmht commented May 12, 2024

Behavior of HEX function is different between dialects #3460

Behavior of HEX function is different between dialects #3460

Comments

viplazylmht commented May 11, 2024

Problem:

The solution if needed:

georgesittas commented May 11, 2024

viplazylmht commented May 11, 2024

viplazylmht commented May 12, 2024