Skip to content

Releases: apache/incubator-gluten

v1.1.1

02 Mar 05:29
7999b61
Compare
Choose a tag to compare

Release Notes - Gluten - Version 1.1.1

We are pleased to announce that Gluten has been accepted as an Apache Incubating project. Additionally, we are excited to unveil the release of Gluten-1.1.1. This version marks the final release before our transition to Apache.

Highlights (Velox backend only)

  • Support Spark 3.2, 3.3, and 3.4(API only)
  • Support 30 common Spark Operators
  • Support 220 common Spark Functions
  • Velox codebase updated to 2024/02/29
  • Refactor Data Lake API to support Delta Lake Scan and Iceberg read COW table
  • Better S3, GCS support
  • More stability in Spill support
  • Enhance metric support for spill, shuffle, and additional metrics.
  • Enhance fallback case support by expanding coverage for missing cases and updating messages accordingly
  • Enhance Shuffle including merge before compressing, push based shuffle, and more
  • More Bug Fixing

What's Changed

  • [GLUTEN-3855][VL] Fix ORC related failed UT by @chenxu14 in #3805
  • [VL] Support IsNull filter pushdown by @rui-mo in #3791
  • [VL] Update velox-backend-limitations.md by @FelixYBW in #3639
  • [GLUTEN-2169][VL] Enable GlutenEnsureRequirementsSuite in unit tests by @JkSelf in #3860
  • [CH] Fix exception of pb MessageToJsonString by @exmy in #3823
  • [GLUTTEN-3851][VL] Add remaining filter time metric by @zhli1142015 in #3852
  • [VL] Support ignoreNulls for NthValue window function by @PHILO-HE in #3857
  • [VL] Enable using static link for QAT by @marin-ma in #3863
  • [VL] Fix assertion failures when mixing use of partial aggregation spilling and flushing by @zhztheplayer in #3872
  • [GLUTEN-3796][VL][FOLLOW_UP] Correct test name match and move black list to exclude in VeloxTestSettings by @zwangsheng in #3874
  • [GLUTEN-3528][VL] Construct unique & non-overlapping partition/sort keys for window operator by @PHILO-HE in #3883
  • [GLUTEN-3879][CH] salt 1% of TPCH-1 data to NULL instead of 10% by @binmahone in #3880
  • [VL] Doc refresh by @zhouyuan in #3882
  • [GLUTEN-3865][CH] Refactor aggregating without keys by @lgbo-ustc in #3866
  • [GLUTEN-3722][CH] Improve shuffle writer by @taiyang-li in #3728
  • [VL] Map date_format to a Velox function name by @PHILO-HE in #3878
  • [VL]Daily Update Velox Version (20231129) by @yma11 in #3877
  • [CORE] Add InputIteratorTransformer to decouple ReadRel and iterator index by @ulysses-you in #3854
  • [GLUTEN-3732][VL] Use arrow result-returning variants FileWriter::Open API by @yangzhg in #3733
  • [CORE] Move validate methods from TransformerApi to ValidatorApi by @exmy in #3881
  • [GLUTEN-3824][CH]Bug fix hdfs path contains space by @KevinyhZou in #3825
  • [GLUTEN-1632][CH]Daily Update Clickhouse Version (20231201) by @lwz9103 in #3898
  • [VL] Break up spilling operation to two phases: shrink phase and spill phase by @zhztheplayer in #3895
  • [GLUTEN-1699][VL] Support loadLibFromJar on RedHat 7/8 by @ychris78 in #3893
  • [GLUTEN-3906] [VL] fix: fix package.sh failed for x86 by @lzjqsdd in #3907
  • [GLUTEN-3750][CH]Bug fix json parse error by @KevinyhZou in #3751
  • [GLUTEN-3902][VL] Add documentation to configure the Velox+GCS connector by @tigrux in #3902
  • [DOC] Revise Gluten document by @PHILO-HE in #3892
  • [VL]Daily Update Velox Version (20231203) by @yma11 in #3913
  • [VL] Minor improvements for CI stale bot by @zhztheplayer in #3888
  • [VL] Avoid reapplying code patches for external projects when ENABLE_EP_CACHE=ON by @zhztheplayer in #3916
  • [VL] minor change for fallback log by @zhli1142015 in #3919
  • [VL] Add sort merge join metrics by @ulysses-you in #3920
  • [GLUTEN-3378][CORE] Datasource V2 data lake read support by @liujiayi771 in #3843
  • [VL] ENABLE_EP_CACHE=ON still uses cached Velox build although the build arguments were changed by @zhztheplayer in #3926
  • [VL] Make bloom_filter_agg fall back when might_contain is not transformable by @zhli1142015 in #3917
  • [VL][CI] update docker build script by @zhouyuan in #3904
  • [GLUTEN-3917][FOLLOWUP] Add back SparkShimLoader import by @ulysses-you in #3940
  • [VL] Fix VeloxTPCHV1BhjSuite and VeloxTPCHV2Suite useV1SourceList by @liujiayi771 in #3930
  • [VL] Fix syntax error in stale.yml by @zhztheplayer in #3945
  • [GLUTEN-3854][CORE][FOLLOWUP] Add ColumnarInputAdapter back to recover UI graph by @ulysses-you in #3933
  • [GLUTEN-1632][CH]Daily Update Clickhouse Version (20231206) by @lwz9103 in #3938
  • [VL] Add output row metric for InputIteratorTransformer by @Yohahaha in #3939
  • [GLUTEN-3927][CH] Improve the performance of element_at by @taiyang-li in #3928
  • [GLUTEN-3908][CH] Improve shuffle split for clickhouse backend by remove ColumnNullable's memcmp by @KevinyhZou in #3909
  • [GLUTEN-3924][CORE] Match hive UDF name in case-insensitive mode during expression transformation by @taiyang-li in #3925
  • [GLUTEN-3958] Use getDeclaredConstructor().newInstance() in ScanTransformerFactory by @liujiayi771 in #3961
  • [GLUTEN-3944][CH]Fix gluten.jar with delta20 when use spark 3.3 by @lwz9103 in #3947
  • [VL] gluten-te: In dockerfiles, use symbolic link for /opt/velox by @zhztheplayer in #3946
  • [VL]Daily Update Velox Version (20231206) by @yma11 in #3954
  • Revert "[GLUTEN-3908][CH] Improve shuffle split for clickhouse backend by remove ColumnNullable's memcmp " by @baibaichen in #3965
  • [GLUTEN-3890][CH] Respect spill_threshold for all buffers in shuffle writer by @taiyang-li in #3891
  • [CORE] Fix wrong fallback cost by @ulysses-you in #3967
  • [GLUTEN-3922][CH] Fix incorrect shuffle hash id value when executing modulo by @zzcclp in #3923
  • [VL] quick fix for static build git conflict by @zhouyuan in #3971
  • [GLUTEN-3486][CH] Fix AQE cannot coalesce shuffle partitions by @exmy in #3941
  • [GLUTEN-3949][CH] Merge small blocks from upstream phase into a large one by @lgbo-ustc in #3952
  • [GLUTEN-3948][CH] Fix exception and diff of trunc function by @exmy in #3968
  • [GLUTEN-3979][CORE] Use exists() instead of map().exists() to improve code readability by @dcoliversun in #3980
  • [VL]Daily Update Velox Version (20231208) by @yma11 in #3973
  • Revert "[VL] Make bloom_filter_agg fall back when might_contain is not transformable (#3917)" by @loneylee in #3977
  • [GLUTEN-3580][VL] support read data from abfs with account key by @gaoyangxiaozhu in #3897
  • [GLUTEN-3991][CH] Fix the incorrect display name for the mergetree file format by @zzcclp in #3992
  • [VL] gluten-te: Enable BuildKit to support --cache-from by @zhztheplayer in #3964
  • [GLUTEN-3841][CH] Support spill in 2nd aggregate stage by @lgbo-ustc in #3772
  • [VL] Daily Update Velox Version (20231211) by @zhztheplayer in #3999
  • [VL] Fix StringToMap test failure by @PHILO-HE in #3995
  • [VL] Make bloom_filter_agg fall back when might_contain is not transformable by @zhli1142015 in #3994
  • [VL] Following #3996, fix CI error "Runtime factory already registered" by @zhztheplayer in #4001
  • [VL] Fix linking simdjson error when building benchmark by @PHILO-HE in #3960
  • [GLUTEN-4002][CH] Update InputIteratorTransformer metrics by @zzcclp in https://github.com/...
Read more

Gluten v1.1.0

30 Nov 10:12
Compare
Choose a tag to compare

Release Notes - Gluten - Version 1.1.0

We are excited to announce the release of Gluten-1.1.0.
This version is the culmination of work from 45 contributors who have worked on features and bug-fixes for a total of over 800 commits since 1.0.0

Highlights (Velox backend only)

  • 20% performance improvement in Decision Support Benchmarks comparing to v1.0.0
  • Support Spark 3.2 and Spark 3.3
  • Support Spark 3.4 (experimental)
  • Run Pass all Velox UTs, Spark 3.2/3.3 SQL related UTs
  • Support Ubuntu 20.04/22.04, CentOS 7/8, alinux 3, Anolis 7/8
  • Support File System: localfs, HDFS, S3, OSS(via s3a), GCS
  • Support File Format: Parquet, ORC
  • Support Data Lake: deltalake (experimental)
  • Support Data Types: Primitive Type, Decimal, Date, Timestamp, Array (partial), Map (partial), Struct (partial)
  • Support 28 common Spark Operators, detail here
  • Support 199 common Spark Functions, detail here
  • Support Dynamic Memory Pool and Spill
  • Support Velox UDF
  • Support Gluten UI to print fallback event in History Server
  • Support Hadoop HA and Kerberos
  • Velox code updated to 20231123(commit-id: aff0cde)
  • Document improvement for support features and configuration

Known Issues

  • Only support static partition write in Spark 3.2 and 3.3

New Features

#3722 [CH] improve mutex usage in shuffle writer
#2063 [CH] Spark sql config load dynamic by task
#3257 [VL] We may need more metrics collected by Velox
#3528 [VL] Construct unique partition/sort keys and removing overlapping sort key for window plan
#3381 [CH]Reuse last WholeStageTransformer instead of creating new one in FileFormatWriter
#2118 [CH] Support hive udtf
#2128 [CH]Support tablesample clause
#2163 [CH] support approx_percentile aggregate function
#2193 [CH] Support some array functions
#2207 [CH] Support function to_utc_timestamp/from_utc_timestamp
#2136 [CH] HiveTransform add metrics readBytes
#2439 [VL] array_aggregate support with lambda function
#2451 [CH] Support StaticInvoke function
#2460 Avoid force check Java thread in native side
#2465 Remove operator level fallback policy
#2472 [CH] Remove BasicScanExecTransformer#getInputFilePaths when CH support more general partition location parsing
#3187 [CH] Implement runtime native bloom filter
#2267 [CH] Support urldecoder which is used in reflect(""java.net.URLDecoder"", ""decode"",event.event_info['currenturl'], ""UTF-8"")
#2309 Implement Streaming Window in Velox backend to reduce the memory usage.
#2323 [CH] Build optimization
#2343 [VL] ShuffleWrite: Larger shuffle size than vanilla spark and long compression time
#2365 [CH] gluten should support setting max bytes for a partition for orc/parquet
#2390 [CH] Aligning the NULL and NaN compare semantics of Spark and CH
#2600 [CH] enhance S3 client caching
#2617 [VL][Spark 3.3+] support pushdown aggregate to native scan insteads of fallback
#2619 [VL][Spark 3.3+] support match columns use filedIds in native insteads of fallback
#2667 [VL] Stacktrace-categorized memory allocation dumping for debugging
#2730 Request for documentation on how to write a backend for 3rd party engines
#2761 [DOC] A doc named index.md share same content with README.md
#2772 [VL] When performance degradation,What factors may affect the performance?
#2783 [VL]Run CI with DEBUG build mode to enhance stability
#2791 [VL] Support spark function: concat_ws
#2793 Code refactor: move some common code to a root module named common
#2807 Code cleanup: FunctionConfig may be useless
#2515 when we will support spark -gpu ,now we need spark -gpu feature to train big model
#2535 UnsupportedOperationException is abused
#2593 List parquet write semantic differents in Spark and gluten
#2804 Handle timeZoneId for TimezoneAwareExpression
#2815 [VL] complex data type support in parquet scan
#2825 [VL] In Java, consolidate GlutenColumnarBatchSerializer and CelebornColumnarBatchSerializer
#2826 [VL] Use a dedicate class to maintain gluten native config
#2845 [VL] Separate each jni wrapper to different files
#2874 [VL] support spark.sql.decimalOperations.allowPrecisionLoss
#2877 [VL] Support read iceberg
#2905 [VL] Support percentile function
#2919 [VL] Support ORC format in HiveTableScanExecTransformer
#2956 [VL] Support NullType in Project
#2975 [VL] Track MemoryManager feature
#3015 [CH] ReusedExchange: Gluten does not touch it or does not support it
#3017 [VL] Allow users to set spill partitions/levels
#3033 [CH] Support aggregation spill for the second stage
#3049 [CORE] Statement level controls whether to use gluten
#3817 [CH] Optimize mergetree prewhwhere
#3704 [CH] support tuple subcolumn pruning for orc/parquet
#3784 DNM
#3144 [CH] Aggregation supports complicate type
#3715 [VL] Add support for GCS
#2106 [VL] CI: allow to benchmark TPCH performance on comment
#3702 [VL] Add sort based window support in velox backend
#2404 [VL] Enable Velox memory reclaimer for auto disk-spilling
#3082 [CORE] Support columnar CollectLimit
#3739 [VL] Add config to disable velox file handle cache
#3055 [VL] Use mixed memory (off-heap and on-heap) for native
#3077 [VL] EP: Centralized lifecycle management for C++ / JNI contextual objects
#3142 [VL] Tight Java-C++ object binding
#3075 [VL] Support static partition write in VL backend
#2533 Degrade Arrow version to 8.0 in VL backend.
#2629 Use Project + Unnest to implement Expand operator
#3132 Add streamingwindow support in velox backend
#3361 Support Spark 3.4 in Gluten.
#3425 [VL] Create Hdfs folder in Gluten side when writing hdfs file
#3541 [VL] Add minimal GHA CI job for debug build
[#3705](https://...
Read more

Gluten v1.0.0

14 Jul 03:07
bfe394b
Compare
Choose a tag to compare

Release Notes - Gluten - Version 1.0.0

Highlights (Velox backend only)

  • Support Spark 3.2 and Spark3.3
  • Run Pass all Velox, Spark3.2 UTs, and partially Spark3.3 UTs
  • Support Ubuntu 20.04/22.04, CentOS 7/8, alinux 3, Anolis 7/8
  • Support FileSystem: localfs, HDFS, S3, OSS (via s3a)
  • Support data types: Primitive type, Decimal, Date, Timestamp
  • Support 20 operators, detail here
  • Support 164 functions, detail here
  • Support native Parquet write
  • Support native ORC read
  • Support Intel® In-memory Analytics Accelerator (IAA/IAX) hardware accelerator in Shuffle compression
  • Support cap-based spill (static memory allocation) for join/agg/sort operator (experimental feature)
  • Support static build method via vcpkg
  • Support local cache (experimental feature)
  • 2.71x speedup in Decision Support Benchmark1 (TPC-H Like) testing
  • 2.29x speedup in Decision Support Benchmark2 (TPC-DS Like) testing
  • Velox code updated to commit
  • Document improvement for support features and configuration

Known Issues

  • Parquet write only support compression.codec, parquet.block.size and parquet.block.rows configurations
  • Velox backend does not support dynamic partition write and bucket write
  • Spill may throw OutOfMemoryExcetpion

New Features

Improvements

Read more

Gluten 0.5.0

07 Apr 09:32
3c3267a
Compare
Choose a tag to compare
Gluten 0.5.0 Pre-release
Pre-release

Change log

Generated on 2023-04-07

Gluten 0.5.0

Gluten 0.5.0 is the 1st preview release from the repository(https://github.com/oap-project/gluten).
In this release, we have merged 971 PRs and fixed 216 issues.

Here is the major highlight in Gluten 0.5.0:

  • Support Spark3.2 and Spark3.3
  • Support Ubuntu20.04 or later
  • Support CentOS7 and 8
  • Support JDK8 only
  • Support GCC9 or later
  • Use Substrait as unified plan
  • Use Velox as default backend engine
  • Use Celeborn as default RSS
  • Support most popular data types including Boolean, Byte, Short, Int, Long, Float, Double, Date, Decimal, String, ...etc.
  • Support Spill for Sort, Agg, and Join operators
  • Run Pass all Spark3.2 Unit Test
  • 2.5x speedup in Decision Support Benchmark1(TPC-H Like) testing
  • 2x speedup in Decision Support Benchmark2(TPC-DS Like) testing
  • Support Intel QAT accelerators in Shuffle compression

Limitations

  • Not Support Complex data type such as Array, Map, Struct
  • OOM happened in some operators not support Spill
  • Decimal result may mismatch in some cases

Features

#974 [CH] Supprt string repeat function
#1008 [CH] Support locate function
#1273 Implement cast decimal to int
#1223 [CH] support reading from S3 and using Clickhouse local cache to speed up
#1131 [Gluten-core] Add an option to only fallback once
#1165 Reduce GC Time when executing BHJ for CH backend.
#1147 [Gluten-core]Make validate failure logLevel configuable
#1100 Making transformer plan log more obvious
#1112 Refactor Gluten metrics and add apis for each backend
#926 gluten timezone not the same as backend
#1039 Remove compute pid metric in shuffle operator.
#882 Selective query execution
#959 Upgrade Arrow version to 11.0.0
#969 Docker for gluten running on centos 8
#986 Align and enrich metrics compare to Spark
#972 Can we separate native dynamic library from build generated jars?
#913 No Spark Shim Provider found for 3.2.0
#853 Support named struct type
#888 Clickhouse backend broadcast relation support r2c
#850 Add cast check in ExpressionTransformer
#825 Setup development environment for macOS
#788 Pass needed hadoop conf from driver to executor

Bugs Fixed

#1284 Scala double data is wronlgy compared with null in a ut
#729 Validation failed for GlutenHashAggregateExecTransformer class
#799 This operator doesn't support doExecuteColumnar
#527 archives for Spark patch versions become unavailable on new releases affecting shims versioning
#523 Some basic failed SQL cases
#1028 [VL] SusbtraitToVeloxPlan error
#858 Sort result mismatch issue with different input records.
#877 Array/Map DataType result mismatch issue when containing null value
#1227 [CH] Scalar subquery filters execute twice for parquet file
#1265 [CH] Rescale decimal trigger fallback
#1233 [CH] Fix fallback issue when reading csv files
#1235 [CH] Fix missing reading from the broadcasted value when executing DPP
#1234 [CH] Fix error 'Invalid number of columns in chunk pushed to OutputPort' when executing hash agg after union all
#1207 shims-spark32 and shims-spark33 may be depencied at the same time
#1161 Bundle built by buildbundle-veloxbe.sh for Spark3.3 is broken
#1210 [CH] Fix the wrong table path of the orders table for TPCH in UT
#1175 FileNotFoundException while executing spark jobs -.so files
#1179 [VL] CI is failing on boost's checksum
#1162 [CK]fix CoaleseBatches metrics
#1124 Memory management not suitable with Velox split preload feature.
#1149 Run tpc-ds core
#741 Handle remainder for the case that its right input is zero
#1090 [TPCH][VL] tpch has some query execution error logs but queries could finish and the result is correct
#1068 [VL] Managed memory leak in imported Spark UTs
#772 Velox does not install folly in centos8 by default, break compile in centos8.
#789 Jar conflicts on Arrow and Protobuf between Vanilla Spark and Gluten
#700 AARCH64 port of Gluten
#1027 [VL] unsupported method
#1072 [CH] Fix NPE when executing BatchScanExecTransformer.getInputFilePaths with MergeTree DS V2
#489 cannot build gluten (velox backend) in Amazon Linux 2
#1012 Enable local cache throw exception
#995 Fix memory leak for ClickHouse Backend
#914 System variables related to Folly could not be found when compiling gluten.
#990 Failed to build velox
#946 Upgrade arrow version to 10.0.1
#860 CH backend inset result not equals spark result
#601 Can't decide data type of null value in gluten test framework, when transforming InteralRow to DataFrame
#843 Unable to convert BHJ to SHJ by using hint
#826 ch_backend not support inset is empty
#815 Gluten + Velox backend does not support Struct dataset with same element name.
#563 Error compiling within -Pbackends-xx,spark-3.3,spark-ut
#560 An unsupportedOperationException interrupted the query execution
#770 VeloxRuntimeError when reading parquet file with only meta data
#800 [UT]ExpectedAnswer may not match SparkAnswer when is sorted
#676 WholeStageTransformerSuite#logForFailedTest() swallows exceptions
#790 Join RuntimeException when having duplicated equal-join keys
#757 Parquet scan not offloaded
#797 It won't load the libparquet.so.1000 when we use Gluten with Velox backend and run it on the yarn.
#784 No Spark Shim Provider found for 3.3.0
#547 Jar conflict issue
#727 build from local velox repo doesn't work

PRs

#1266 [GLUTEN-1246] [CORE] Fix scale may be negative issue
#1313 [VL] Update doc for centos7 install
#1312 [CH] Ignore ch backend tpcds suite
#1198 [VL] fix: Update Velox setup scripts for centos 7
#1294 [VL] Following #1185, do some clean-ups against Velox + Celeborn CI
[#1196](https://github.com/oa...
Read more