Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][Connector Hive] support hive savemode #6842

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

liunaijie
Copy link
Contributor

@liunaijie liunaijie commented May 11, 2024

Purpose of this pull request

subtask of #5390

  1. implement hive savemode feature
  2. remove hive metadata using hive2 jdbc (remove hive thrift url, using hive2 jdbc instead)

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

@liunaijie liunaijie marked this pull request as ready for review May 16, 2024 10:49
@liunaijie
Copy link
Contributor Author

@EricJoy2048 @dailai @ruanwenjun hi, guys. PTAL when you have time.

@@ -66,7 +74,8 @@ public OptionRule optionRule() {

ReadonlyConfig finalReadonlyConfig =
generateCurrentReadonlyConfig(readonlyConfig, catalogTable);
return () -> new HiveSink(finalReadonlyConfig, catalogTable);
CatalogTable finalCatalog = renameCatalogTable(finalReadonlyConfig, catalogTable);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with target hive sink table name, if not replace here, will pass source table name to hive.
like fake to hive sink. so when use this catalog, will has issue, replaced here

String describeFormattedTableQuery = "describe formatted " + tablePath.getFullName();
try (PreparedStatement ps = connection.prepareStatement(describeFormattedTableQuery)) {
ResultSet rs = ps.executeQuery();
return processResult(rs, tablePath, builder, partitionKeys);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now hive table informaction is parse from the query result. That's not very elegant, but it work.....

@EricJoy2048 EricJoy2048 changed the title [Feature] support hive savemode [Feature][Connector Hive] support hive savemode May 16, 2024
.withValue(
FIELD_DELIMITER.key(),
ConfigValueFactory.fromAnyRef(
parameters.get("field.delim")))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line will has issue if the field.delim is \t, the ConfigValueFactory.fromAnyRef with replace it to \\t. then the writted data will has issue.

@NoPr
Copy link

NoPr commented May 22, 2024

大佬你好,创建的statement好像有点问题
org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

@liunaijie
Copy link
Contributor Author

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

@@ -33,7 +33,7 @@ By default, we use 2PC commit to ensure `exactly-once`
| name | type | required | default value |
|-------------------------------|---------|----------|----------------|
| table_name | string | yes | - |
| metastore_uri | string | yes | - |
| hive_jdbc_url | string | yes | - |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to be compatible with older versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it not compatible with old version, because i don't want both use hive2 jdbc and hive metastore. so i removed the metastore only use jdbc.

@NoPr
Copy link

NoPr commented May 22, 2024

partition
image

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

是的,当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时,save_mode_create_template 是必须的吗?当为 “” 时,报错为我描述的那个错误

@liunaijie
Copy link
Contributor Author

partition
image

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

是的,当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时,save_mode_create_template 是必须的吗?当为 “” 时,报错为我描述的那个错误

是必须的, 这个是你表不存在时的要执行的建表语句

@NoPr
Copy link

NoPr commented May 22, 2024

partition
image

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

是的,当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时,save_mode_create_template 是必须的吗?当为 “” 时,报错为我描述的那个错误

是必须的, 这个是你表不存在时的要执行的建表语句

那如果我不知道source的表结构的话,是否该source-sink的conf就不能成立?这个和mysql的schema_save_mode 的配置实现的效果不一样吗?

@liunaijie
Copy link
Contributor Author

partition
image

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

是的,当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时,save_mode_create_template 是必须的吗?当为 “” 时,报错为我描述的那个错误

是必须的, 这个是你表不存在时的要执行的建表语句

那如果我不知道source的表结构的话,是否该source-sink的conf就不能成立?这个和mysql的schema_save_mode 的配置实现的效果不一样吗?

有些许的不一样
可以拿到source的表结构, 然后根据这个结构去创建表
但是其他的Hive配置, 比如内表/外表, 外表路径, 存储格式等等的配置是无法拿到的. 所以添加了这个参数 希望用户自定义DDL语句

@NoPr
Copy link

NoPr commented May 22, 2024

partition
image

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

大佬你好,创建的statement好像有点问题 org.apache.hadoop.hive.ql.parse.ParseException:line 1:0 cannot recognize input near '' '' ''

是提交partition信息的语句吗 还是哪个语句?

是的,当schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"时,save_mode_create_template 是必须的吗?当为 “” 时,报错为我描述的那个错误

是必须的, 这个是你表不存在时的要执行的建表语句

那如果我不知道source的表结构的话,是否该source-sink的conf就不能成立?这个和mysql的schema_save_mode 的配置实现的效果不一样吗?

有些许的不一样 可以拿到source的表结构, 然后根据这个结构去创建表 但是其他的Hive配置, 比如内表/外表, 外表路径, 存储格式等等的配置是无法拿到的. 所以添加了这个参数 希望用户自定义DDL语句

那也就是说hive的自定义建表需要满足:
1.已知source表结构
2.自定义建表语句并作为sink的参数

@NoPr
Copy link

NoPr commented May 23, 2024

image
老哥稳,我试过了。
image

@liunaijie
Copy link
Contributor Author

liunaijie commented May 23, 2024

image 老哥稳,我试过了。 image

目前的代码还有几个问题:

  1. 你上面的语句 没有指定分隔符, 如果指定了类似于 \t 这样的分割符, 写入Config后会变成 \\t, 导致文件写入有问题.
  2. 由于查询hive表结构是通过desc formatted <table_name>的方式 然后解析sql结果. 目前发现在不同的版本中 返回语句会略有不同, 3.1.3 版本 在 # col_name与真正的字段名直接没有空行, 而在2.1.1版本中 发现会多一个空行. 目前的代码这块的处理逻辑需要兼容 以及优化.
  3. 不确定开启kerberos认证后 是否可以连通, 我这边是使用用户名密码, 可以通过在jdbc url中设置完成这个认证.

@zhilinli123
Copy link
Contributor

zhilinli123 commented May 24, 2024

- fix storage parameter encode issue
- update add/drop partition sql
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants