Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Unsupported spark function list [please leave a comment if you plan to pick some] #4039

Open
48 of 93 tasks
PHILO-HE opened this issue Dec 14, 2023 · 57 comments
Open
48 of 93 tasks
Labels
enhancement New feature or request

Comments

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Dec 14, 2023

Description

Here listed spark functions still not supported by Gluten Velox backend. Please leave a comment if you'd like to pick some. In the below list, [√] means someone is working in progress for the corresponding function.
You can find all functions' support status from this gluten doc.

To avoid duplicate work, before starting, please check whether a PR has been submitted in Velox community or whether it has already been implemented in Velox who holds most sql functions in its sparksql folder & prestosql folder.

Reference:


  • percentile_approx/approx_percentile (WIP, guangxin)
  • concat_ws (PR ready, Add concat_ws Spark function facebookincubator/velox#8854)
  • unix_timestamp: "Only supports string type, with session timezone considered, todo: support date type"
  • locate
  • parse_url (PR drafted, not merged)
  • urldecoder: "UDF, supported by spark as a built-in function since 3.4.0."
  • normalizenanandzero
  • arrayintersects
  • default.json_split (udf, no need to impl.): "external UDF"
  • parsejsonarray: "external UDF"
  • struct
  • percentile (@Yohahaha)
  • first/first_value (@JkSelf)
  • last/last_value (@JkSelf)
  • posexplode (WIP, @marin-ma)
  • trunc (WIP, HannanKan)
  • months_between (PR ready)
  • date_trunc (WIP, HannanKan)
  • stack
  • grouping_id
  • printf (@Surbhi-Vijay)
  • space (WIP, rhh777)
  • inline (WIP, @marin-ma)
  • to_unix_timestamp: "Only supports string type, with session timezone considered. todo: support date type"
  • from_csv
  • from_json
  • json_object_keys
  • json_tuple
  • schema_of_csv
  • schema_of_json
  • to_csv
  • to_json (Suppose workable with folly function used)
  • make_ym_interval (WIP, @marin-ma)
  • make_timestamp (WIP, @marin-ma)
  • make_interval
  • make_dt_interval
  • from_utc_timestamp (@acvictor)
  • extract
  • exists (@lyy-pineapple)
  • date_part
  • zip_with
  • transform (@Yohahaha)
  • transform_keys
  • transform_values
  • map_from_entries (WIP, MaYan)
  • map_filter (WIP, MaYan)
  • map_entries (Done, by MaYan)
  • map_concat
  • forall (@lyy-pineapple)
  • flatten (@ivoson)
  • filter
  • filter (array) (@ivoson)
  • width_bucket
  • array_sort (may just map to a velox function)
  • xpath
  • xpath_boolean
  • xpath_double
  • xpath_float
  • xpath_int
  • xpath_long
  • xpath_number
  • xpath_short
  • xpath_string
  • unbase64 (WIP, @fyp711)
  • decode (partially supported if translated to caseWhen. WIP Cody)
  • initcap (WIP, velox PR: 8676)
  • unix_date (velox PR 8725, completed)
  • count_min_sketch
  • bool_and/every (@mskapilks)
  • bool_or/any/some (@mskapilks)
  • shuffle (completed)
  • bround (@xumingming)
  • format_string (@gaoyangxiaozhu)
  • format_number (@gaoyangxiaozhu)
  • soundex (@zhli1142015)
  • levenshtein (@zhli1142015)
  • cot (@honeyhexin)
  • expm1 (@Donvi)
  • stack (generator function, @xumingming)
  • Since Spark-3.3 (related to ML, low priority)
  • regr_count
  • regr_avgx
  • regr_avgy
  • regr_r2
  • regr_sxx
  • regr_sxy
  • regr_syy
  • regr_slope
  • regr_intercept
  • Since Spark-3.3
  • Since Spark-3.4
@PHILO-HE PHILO-HE added the enhancement New feature or request label Dec 14, 2023
@PHILO-HE PHILO-HE pinned this issue Dec 14, 2023
@PHILO-HE PHILO-HE changed the title [VL] Spark function support list [please leave comment/mark if you plan to implement] [VL] Unsupported spark function list [please leave comment/mark if you plan to implement] Dec 15, 2023
@PHILO-HE PHILO-HE changed the title [VL] Unsupported spark function list [please leave comment/mark if you plan to implement] [VL] Unsupported spark function list [please leave a comment if you plan to pick some] Dec 15, 2023
@Yohahaha
Copy link
Contributor

Yohahaha commented Dec 29, 2023

I'd like support hex and unhex.

update: hex and unhex has already supported in Gluten.

@zwangsheng
Copy link
Contributor

Hi i'd like to give a try with hour function.

@konjac
Copy link
Contributor

konjac commented Jan 4, 2024

Hi, I'd like to have a look into map_keys

@fyp711
Copy link
Contributor

fyp711 commented Jan 11, 2024

Hi I'd like to support find_in_set in velox

@HannanKan
Copy link
Contributor

Hi, I'd like to support date_trunc/trunc.

@JkSelf
Copy link
Contributor

JkSelf commented Jan 22, 2024

Hi, I'd like to support dense_rank.

@JkSelf
Copy link
Contributor

JkSelf commented Jan 22, 2024

dense_rank already supported in velox facebookincubator/velox#6289.

@zhztheplayer
Copy link
Member

  • percentile_approx
  • approx_percentile: "Third argument accuracy is different with velox, velox is double but spark is long"

The two stand for the same function I assume? I'll take these two if nobody is working on it.

@PHILO-HE
Copy link
Contributor Author

  • percentile_approx
  • approx_percentile: "Third argument accuracy is different with velox, velox is double but spark is long"

The two stand for the same function I assume? I'll take these two if nobody is working on it.

Yes, they are one thing. Just unify them into one checkbox. Thanks!

@JkSelf
Copy link
Contributor

JkSelf commented Jan 22, 2024

I will take a look ntile window function.

@zhouyuan
Copy link
Contributor

ubase64:
#4482

@zjuwangg
Copy link
Contributor

Is there any plan to suppport from_json function?

@yma11
Copy link
Contributor

yma11 commented Jan 29, 2024

I'd like take map_entries and map_from_entries, there are already presto implementation in velox, will need check consistency .

@acvictor
Copy link
Contributor

I'd like to give date_from_unix_date a shot

@PHILO-HE
Copy link
Contributor Author

PHILO-HE commented Feb 21, 2024

Just removed the below functions from the list, since they have been supported. Thanks! @acvictor, @Yohahaha, @fyp711, @zwangsheng, @JkSelf, etc.

to_date hour mod pow ifnull add_months next_day dense_rank find_in_set hex ntile
date_from_unix_date array_repeat array_position array_except array_distinct weekday
year month day

@acvictor
Copy link
Contributor

acvictor commented Feb 21, 2024

@PHILO-HE I see support for year, month, day, last_day in Velox too. I can also give from_utc_timestamp a go.

@Surbhi-Vijay
Copy link
Contributor

nullif is out of the box supported. Spark send the converted expression as If expression and it is supported in Gluten.

@PHILO-HE
Copy link
Contributor Author

nullif is out of the box supported. Spark send the converted expression as If expression and it is supported in Gluten.

Thanks so much for your feedback! Just removed it from the list.

@acvictor
Copy link
Contributor

@PHILO-HE I see support for year, month, day, last_day in Velox too. I can also give from_utc_timestamp a go.

Will do minute as well.

@rui-mo
Copy link
Contributor

rui-mo commented Feb 26, 2024

I'd like to work on locate and arrayintersect.

@mskapilks
Copy link
Contributor

I would like to work on bool_and, bool_or

@zhztheplayer
Copy link
Member

zhztheplayer commented Feb 29, 2024

  • collect_list (velox supported, needs Gluten to enable array for project plan node)
  • collect_set

@PHILO-HE Should we uncheck these two? I ran a test and the two functions are both fallen back (in 3.3).

@Surbhi-Vijay
Copy link
Contributor

I would like to give printf a try.

@Yohahaha
Copy link
Contributor

I'd like take get function, as known as GetArrayItem.

@Yohahaha
Copy link
Contributor

I'd like take transform function.

@lyy-pineapple
Copy link
Contributor

@PHILO-HE hello, I'd like take forall function.

@ivoson
Copy link
Contributor

ivoson commented Apr 17, 2024

I'd like to take flatten function.

@acvictor
Copy link
Contributor

I'd like to try array_size.

@lyy-pineapple
Copy link
Contributor

@PHILO-HE hello, I'd like take forall function.

and exists(array) also support

@gaoyangxiaozhu
Copy link
Contributor

hey @zhouyuan could you help add format_string and format_number in the list and I would take format_string and format_number later

@PHILO-HE
Copy link
Contributor Author

hey @zhouyuan could you help add format_string and format_number in the list and I would take format_string and format_number later

@gaoyangxiaozhu, just added them into the list. Thanks!

@zhli1142015
Copy link
Contributor

I'd like to take soundex and levenshtein, thanks.

@honeyhexin
Copy link

I'd like to take cot, thanks.

@Donvi
Copy link
Contributor

Donvi commented Apr 30, 2024

I'd like and am working in the math function expm1.

@gaoyangxiaozhu
Copy link
Contributor

gaoyangxiaozhu commented May 7, 2024

PR for width_bucket support, #5634 looks still need velox side change for to support case as bucket_number <=0, will send PR in velox repository to fix

@ivoson
Copy link
Contributor

ivoson commented May 9, 2024

I'd like to implement array_append and array_insert for spark 3.4+

@xumingming
Copy link
Contributor

I'd like to take a look at stack function, it seems like a Generator, meaning one row of input might return multiple rows of output, does Velox has this generator ability?

@marin-ma
Copy link
Contributor

marin-ma commented May 14, 2024

I'd like to take a look at stack function, it seems like a Generator, meaning one row of input might return multiple rows of output, does Velox has this generator ability?

@xumingming Currently, 4 generator functions are supported : explode, pos_explode, inline and json_tuple. The approach is creating a ProjectNode + UnnestNode + ProjectNode pattern in Velox pipeline. But seems like the stack function cannot use this pattern. Perhaps we can build another pipeline by leveraging the ExpandNode in Velox (Not sure if this approach really works).

@xumingming
Copy link
Contributor

@marin-ma Thanks for the advice, I will take a look.

@NEUpanning
Copy link
Contributor

I'd like to take unix_date, thanks.

@PHILO-HE
Copy link
Contributor Author

PHILO-HE commented May 17, 2024

I'd like to take unix_date, thanks.

@NEUpanning, we have supported it in both Gluten & Velox. Just changed its state in the list. Thanks!
#5287
facebookincubator/velox#8725

@NEUpanning
Copy link
Contributor

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

@PHILO-HE
Copy link
Contributor Author

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

@NEUpanning, this list only maintains working-in-progress functions. I think to_date has been supported. See https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-support-progress.md.

date_part may be supported also. I note the below test in Gluten. You can confirm whether all date patterns have been supported.

"SELECT date_part('yearofweek', dt), extract(yearofweek from dt)" +

@NEUpanning
Copy link
Contributor

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

@NEUpanning, this list only maintains working-in-progress functions. I think to_date has been supported. See https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-support-progress.md.

date_part may be supported also. I note the below test in Gluten. You can confirm whether all date patterns have been supported.

"SELECT date_part('yearofweek', dt), extract(yearofweek from dt)" +

I can't find any implementation of date_part and to_date function in Velox. Would you like to help me find it? Thanks.

@xumingming
Copy link
Contributor

shuffle, array_sort are already supported, can be marked as complete.

@xumingming
Copy link
Contributor

xumingming commented May 22, 2024

I will take a look at bround.

@PHILO-HE
Copy link
Contributor Author

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

@NEUpanning, this list only maintains working-in-progress functions. I think to_date has been supported. See https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-support-progress.md.
date_part may be supported also. I note the below test in Gluten. You can confirm whether all date patterns have been supported.

"SELECT date_part('yearofweek', dt), extract(yearofweek from dt)" +

I can't find any implementation of date_part and to_date function in Velox. Would you like to help me find it? Thanks.

@NEUpanning, not a direct replacement. date_part is covered here. to_date is converted to Cast + GetTimestamp by Spark.

@PHILO-HE
Copy link
Contributor Author

PHILO-HE commented May 24, 2024

shuffle, array_sort are already supported, can be marked as complete.

@xumingming, seems sort_array is supported, but array_sort is not. Please spare some time to confirm. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests