Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behavior of HEX function is different between dialects #3460

Closed
viplazylmht opened this issue May 11, 2024 · 3 comments · Fixed by #3463
Closed

Behavior of HEX function is different between dialects #3460

viplazylmht opened this issue May 11, 2024 · 3 comments · Fixed by #3463

Comments

@viplazylmht
Copy link
Contributor

Problem:

HEX function encodes a bytes or string into a hexadecimal encoding, but the result of the function is vary between dialects, due to the case of the result.

For example,
in bigquery (lowercase):

select to_hex(MD5('Hello'))
|  Row  |              f0_            |  
| ----- | --------------------------- |
|   1   | 8b1a9953c4611296a827abf8c47 |

in presto/trino (uppercase):

presto> select to_hex(to_utf8('some string'));
         _col0          
------------------------
 736F6D6520737472696E67 
(1 row)

in clickhouse return uppercase.
in spark (uppercase), but hex(md5_digest(x)) produce a lowercase result:

>>> spark.sql("select hex('Hello')").collect()
[Row(hex(Hello)='48656C6C6F')]
>>> spark.sql("select md5('Hello')").collect()
[Row(md5(Hello)='8b1a9953c4611296a827abf8c47804d7')]

In snowflake, the default behavior is uppercase, but we can pass an additional argument to control the case: https://docs.snowflake.com/en/sql-reference/functions/hex_encode#argumentshttps://docs.snowflake.com/en/sql-reference/functions/hex_encode#arguments

And so on...

=> So, as a transpiler tool, should we ensure the behavior of this function while transpiling?

The solution if needed:

We can add a new exp.UpperHex (or exp.LowerHex, depends on the case of default exp.Hex).
Also, we add a flag to Dialect class, which mark the behavior of HEX function of the target dialect is uppercase or lowercase, then can safety and can simplify the SQL while parsing or generating the sql.

For example, if read dialect is bigquery (lowercase), write dialect is trino (uppercase), the following step will happen when transpiling the sql select lower(to_hex(col_x)):

  1. Parse sql to bigquery: becasue bq are already lowercased, the tree will look like: exp.SELECT (exp.Hex( col_x ))
  2. Generate trino sql: because trino is uppercased, generator will be generating SELECT LOWER(TO_HEX(col_x))

If we transpile select upper(to_hex(col_x)) from bigquery to trino, the code will work like this:

  1. parse sql to exp.Select( exp.UpperHex( col_x ))
  2. Generate trinno sql: SELECT TO_HEX(col_x)

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@georgesittas
Copy link
Collaborator

Thanks for the reports / PRs @viplazylmht, these make sense. Perhaps we could only add a single UpperHex node and treat the current one as LowerHex? Would you like to work on this?

@viplazylmht
Copy link
Contributor Author

@georgesittas Thanks, let me finish this

@viplazylmht
Copy link
Contributor Author

I found hive dialect is uppercase too. Spark inherits spark2, which inherits hive, all are the same. So, maybe most of dialects are in uppercase.

0: jdbc:hive2://localhost:10000/> select hex('Hello');
INFO  : Compiling command(queryId=hive_20240512035403_4f5122ef-45ac-4a13-a743-99d0b69736a3): select hex('Hello')
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20240512035403_4f5122ef-45ac-4a13-a743-99d0b69736a3); Time taken: 0.151 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20240512035403_4f5122ef-45ac-4a13-a743-99d0b69736a3): select hex('Hello')
INFO  : Completed executing command(queryId=hive_20240512035403_4f5122ef-45ac-4a13-a743-99d0b69736a3); Time taken: 0.001 seconds
+-------------+
|     _c0     |
+-------------+
| 48656C6C6F  |
+-------------+
1 row selected (0.176 seconds)
0: jdbc:hive2://localhost:10000/>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants