Skip to content

Releases: TimDettmers/bitsandbytes

0.43.1: Improved CUDA setup/diagnostics + 8-bit serialization, CUDA 12.4 support, docs enhancements

11 Apr 18:36
Compare
Choose a tag to compare

Improvements:

  • Improved the serialization format for 8-bit weights; this change is fully backwards compatible. (#1164, thanks to @younesbelkada for the contributions and @akx for the review).
  • Added CUDA 12.4 support to the Linux x86-64 build workflow, expanding the library's compatibility with the latest CUDA versions. (#1171, kudos to @matthewdouglas for this addition).
  • Docs enhancement: Improved the instructions for installing the library from source. (#1149, special thanks to @stevhliu for the enhancements).

Bug Fixes

  • Fix 4bit quantization with blocksize = 4096, where an illegal memory access was encountered. (#1160, thanks @matthewdouglas for fixing and @YLGH for reporting)

Internal Improvements:

0.43.0: FSDP support, Official documentation, Cross-compilation on Linux and CI, Windows support

08 Mar 01:42
Compare
Choose a tag to compare

Improvements and New Features:

  • QLoRA + FSDP official support is now live! #970 by @warner-benjamin and team - with FSDP you can train very large models (70b scale) on multiple 24GB consumer-type GPUs. See https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html for more details.
  • Introduced improvements to the CI process for enhanced performance and efficiency during builds, specifically enabling more effective cross-compilation on Linux platforms. This was accomplished by deprecating Make and migrating to Cmake, as well as implementing new corresponding workflows. Huge thanks go to @wkpark, @rickardp, @matthewdouglas and @younesbelkada; #1055, #1050, #1111.
  • Windows should be officially supported in bitsandbytes with pip install bitsandbytes
  • Updated installation instructions to provide more comprehensive guidance for users. This includes clearer explanations and additional tips for various setup scenarios, making the library more accessible to a broader audience (@rickardp, #1047).
  • Enhanced the library's compatibility and setup process, including fixes for CPU-only installations and improvements in CUDA setup error messaging. This effort aims to streamline the installation process and improve user experience across different platforms and setups (@wkpark, @akx, #1038, #996, #1012).
  • Setup a new documentation at https://huggingface.co/docs/bitsandbytes/main with extensive new sections and content to help users better understand and utilize the library. Especially notable are the new API docs. (big thanks to @stevhliu and @mishig25 from HuggingFace #1012). The API docs have been also addressed in #1075

Bug Fixes:

  • Addressed a race condition in kEstimateQuantiles, enhancing the reliability of quantile estimation in concurrent environments (@pnunna93, #1061).
  • Fixed various minor issues, including typos in code comments and documentation, to improve code clarity and prevent potential confusion (@nairbv, #1063).

Backwards Compatibility

  • After upgrading from v0.42 to v0.43, when using 4bit quantization, models may generate slightly different outputs (approximately up to the 2nd decimal place) due to a fix in the code. For anyone interested in the details, see this comment.

Internal and Build System Enhancements:

  • Implemented several enhancements to the internal and build systems, including adjustments to the CI workflows, portability improvements, and build artifact management. These changes contribute to a more robust and flexible development process, ensuring the library's ongoing quality and maintainability (@rickardp, @akx, @wkpark, @matthewdouglas; #949, #1053, #1045, #1037).

Contributors:

This release is made possible thanks to the many active contributors that submitted PRs and many others who contributed to discussions, reviews, and testing. Your efforts greatly enhance the library's quality and user experience. It's truly inspiring to work with such a dedicated and competent group of volunteers and professionals!

We give a special thanks to @TimDettmers for managing to find a little bit of time for valuable consultations on critical topics, despite preparing for and touring the states applying for professor positions. We wish him the utmost success!

We also extend our gratitude to the broader community for your continued support, feedback, and engagement, which play a crucial role in driving the library's development forward.

4-bit serialization and bug fixes

08 Jan 01:19
Compare
Choose a tag to compare

This release added 4-bit serialization, implemented by @poedator, to bitsandbytes. With this,you can call model.save() and model.load() for models that contain 4-bit bitsandbytes layers meaning you can save and load 4-bit models. All of this is integrated with the Hugging Face transformers stack. The 0.42.0 release also comes with many bug fixes. See below for detailed change logs.

0.42.0

Features:

  • 4-bit serialization now supported. This enables 4-bit load/store. Thank you @poedator #753
  • the bitsandbytes library now has a version attribute: bitsandbytes.__version__ @rasbt #710

Bug fixes:

  • Fixed bugs in dynamic exponent data type creation. Thank you @RossM, @KohakuBlueleaf, @ArrowM #659 #227 #262 #152
  • Fixed an issue where 4-bit serialization would fail for layers without double quantization #868. Thank you, @poedator
  • Fixed an issue where calling .to() or .cuda() on a 4-bit layer twice would result in an error #867. Thank you, @jph00
  • Fixed a bug where a missing access permission in a path searched for CUDA would lead to an error @osma #677
  • Fixed a bug where the GOOGLE_VM_CONFIG_LOCK_FILE variable could cause errors in colab environments @akrentsel @xaptronic #715 #883 #622
  • Fixed a bug where kgetColRowStats (LLM.int8()) would fail for certain dimensions @LucQueen #905
  • Fixed a bug where the adjusted regular Embedding layer was not available via bnb.nn.Embedding @neel04 #563
  • Fixed added missing scipy requirement @dulalbert #525

Bug and CUDA fixes + performance

23 Jul 14:10
Compare
Choose a tag to compare

Release 0.41.0 features an overhaul of the CUDA_SETUP routine. We trust PyTorch to find the proper CUDA binaries and use those. If you use a CUDA version that differs from PyTorch, you can now control the binary that is loaded for bitsandbytes by setting the BNB_CUDA_VERSION variable. See the custom CUDA guide for more information.

Besides that, this release features a wide range of bug fixes, CUDA 11.8 support for Ada and Hopper GPUs, and an update for 4-bit inference performance.

Previous 4-bit inference kernels were optimized for RTX 4090 and Ampere A40 GPUs, but the performance was poor for A100 GPUs, which are common. In this release, A100 performance is slightly improved (40%) and is not faster than 16-bit inference, while RTX 4090 and A40 is slightly lower (10% lower).

This leads to approximate speedups compared to 16-bit (BF16) of roughly:

  • RTX 4090: 3.8x
  • RTX 3090 / A40: 3.1x
  • A100: 1.5x
  • RTX 6000: 1.3x
  • RTX 2080 Ti: 1.1x

0.41.0

Features:

  • Added precompiled CUDA 11.8 binaries to support H100 GPUs without compilation #571
  • CUDA SETUP now no longer looks for libcuda and libcudart and relies PyTorch CUDA libraries. To manually override this behavior see: how_to_use_nonpytorch_cuda.md. Thank you @rapsealk

Bug fixes:

  • Fixed a bug where the default type of absmax was undefined which leads to errors if the default type is different than torch.float32. # 553
  • Fixed a missing scipy dependency in requirements.txt. #544
  • Fixed a bug, where a view operation could cause an error in 8-bit layers.
  • Fixed a bug where CPU bitsandbytes would during the import. #593 Thank you @bilelomrani
  • Fixed a but where a non-existent LD_LIBRARY_PATH variable led to a failure in python -m bitsandbytes #588
  • Removed outdated get_cuda_lib_handle calls that lead to errors. #595 Thank you @ihsanturk
  • Fixed bug where read-permission was assumed for a file. #497
  • Fixed a bug where prefetchAsync lead to errors on GPUs that do not support unified memory but not prefetching (Maxwell, SM52). #470 #451 #453 #477 Thank you @jllllll and @stoperro

Documentation:

  • Improved documentation for GPUs that do not support 8-bit matmul. #529
  • Added description and pointers for the NF4 data type. #543

User experience:

  • Improved handling of default compute_dtype for Linear4bit Layers, so that compute_dtype = input_dtype if the input data type is stable enough (float32, bfloat16, but not float16).

Performance:

  • improved 4-bit inference performance for A100 GPUs. This degraded performance for A40/RTX3090 and RTX 4090 GPUs slightly.

Deprecated:

  • 8-bit quantization and optimizers that do not use blockwise quantization will be removed on 0.42.0. All blockwise methods will remain fully supported.

4-bit Inference

12 Jul 00:25
Compare
Choose a tag to compare

Efficient 4-bit Inference (NF4, FP4)

This release adds efficient inference routines for batch size 1. Expected speedups vs 16-bit precision (fp16/bf16) for matrix multiplications with inner product dimension of at least 4096 (LLaMA 7B) is:

  • 2.2x for Turing (T4, RTX 2080, etc.)
  • 3.4x for Ampere (A100, A40, RTX 3090, etc.)
  • 4.0x for Ada/Hopper (H100, L40, RTX 4090, etc.)

The inference kernels for batch size 1 are about 8x faster than 4-bit training kernel for QLoRA. This means you can take advantage the new kernels by separating a multi-batch 4-bit query into multiple requests with batch size 1.

No code changes are needed to take advantage of the new kernels as long as a batch size of 1 is used.

Big thanks to @crowsonkb, @Birch-san, and @sekstini for some beta testing and helping to debug some early errors.

Changelog

Features:

  • Added 4-bit inference kernels for batch size=1. Currently supported are the NF4, FP4 data types.
  • Added support for quantizations of bfloat16 input data.

Bug fixes:

  • Added device variable for bitsandbytes layers to be compatible with PyTorch layers.

Deprecated:

  • Binaries for CUDA 11.2, 11.6 no longer ship with pip install bitsandbytes and need to be compiled from source.

4-bit QLoRA, Paged Optimizers, and 8-bit Memory Leak Bugfix

20 Jun 02:50
Compare
Choose a tag to compare

This release brings 4-bit quantization support for QLoRA fine-tuning and a critical bugfix that doubled the memory cost of 8-bit models when they were serialized. Furthermore, paged optimizers are introduced, including 8-bit Lion.

0.39.1

Features:

  • 4-bit matrix multiplication for Float4 and NormalFloat4 data types.
  • Added 4-bit quantization routines
  • Doubled quantization routines for 4-bit quantization
  • Paged optimizers for Adam and Lion.
  • bfloat16 gradient / weight support for Adam and Lion with 8 or 32-bit states.

Bug fixes:

  • Fixed a bug where 8-bit models consumed twice the memory as expected after serialization (thank you @mryab)

Deprecated:

  • Kepler binaries (GTX 700s and Tesla K40/K80) are no longer provided via pip and need to be compiled from source. Kepler support might be fully removed in the future.

8-bit Lion, 8-bit Load/Store from HF Hub

12 Apr 15:13
Compare
Choose a tag to compare

8-bit Lion, Load/Store 8-bit Models directly from/to HF Hub

This release brings 8-bit Lion to bitsandbytes. Compared to standard 32-bit Adam, it is 8x more memory efficient.

Furthermore, now models can now be serialized in 8-bit and pushed to the HuggingFace Hub. This means you can also load them from the hub in 8-bit, making big models much easier to download and load into CPU memory.

To use this feature, you need the newest transformer release (this will likely be integrated into the HF transformer release tomorrow).

In this release, CUDA 10.2 and GTX 700/K10 GPUs are deprecated in order to allow for broad support of bfloat16 in release 0.39.0.

Features:

  • Support for 32 and 8-bit Lion has been added. Thank you @lucidrains
  • Support for serialization of Linear8bitLt layers (LLM.int8()). This allows to store and load 8-bit weights directly from the HuggingFace Hub. Thank you @mryab
  • New bug report features python -m bitsandbytes now gives extensive debugging details to debug CUDA setup failures.

Bug fixes:

  • Fixed a bug where some bitsandbytes methods failed in a model-parallel setup on multiple GPUs. Thank you @tonylins
  • Fixed a bug where cudart.so libraries could not be found in newer PyTorch releases.

Improvements:

  • Improved the CUDA Setup procedure by doing a more extensive search for CUDA libraries

Deprecated:

  • Devices with compute capability 3.0 (GTX 700s, K10) and 3.2 (Tegra K1, Jetson TK1) are now deprecated and support will be removed in 0.39.0.
  • Support for CUDA 10.0 and 10.2 will be removed in bitsandbytes 0.39.0

Int8 Matmul backward for all GPUs

02 Feb 14:51
Compare
Choose a tag to compare

This release changed the default bitsandbytets matrix multiplication (bnb.matmul) to now support memory efficient backward by default. Additionally, matrix multiplication with 8-bit weights is supported for all GPUs.

During backdrop, the Int8 weights are converted back to a row-major layout through an inverse index. The general matmul for all GPUs by using Int8 weights is done by casting the weights from Int8 to the inputs data type (FT32/FP32/BF16/F16) and then doing standard matrix multiplication. As such, the matrix multiplication during backdrop and for non-tensor-core devices will be memory efficient, but slow.

These contributions were the work of Alexander Borzunov and Yozh, thank you!

Features:

  • Int8 MatmulLt now supports backward through inversion of the ColTuring/ColAmpere format. Slow, but memory efficient. Big thanks to @borzunov
  • Int8 now supported on all GPUs. On devices with compute capability < 7.5, the Int weights are cast to 16/32-bit for the matrix multiplication. Contributed by @borzunov

Improvements:

  • Improved logging for the CUDA detection mechanism.

Ada/Hopper+fake k-bit quantization

04 Jan 11:57
Compare
Choose a tag to compare

The 0.36.0 release brings a lot of bug fixes, improvements, and new features:

  • better automatic CUDA detection & setup
  • better automatic compilation instruction generation in the case of failures
  • CUDA 11.8 and 12.0 support
  • Ada (RTX 40s series) and Hopper (H100) support
  • Added fake k-bit float, int, and quantile quantization (2 <= k <= 8, Int8 storage)

Additional features also include fake k-bit quantization and smaller block sizes for block-wise quantization, which are used in our k-bit Inference Scaling Laws work. Fake k-bit quantization is useful to simulated k-bit data types, but they do not provide memory or runtime benefits. Here is how you use these features.

Faster block-wise quantization that now allows for very small block sizes of down to 64:

from bitsandbytes import functional as F
q, state = F.quantize_blockwise(X, blocksize=64)
X = F.dequantize_blockwise(q, state, blocksize=64)

k-bit fake quantization via block-wise quantization:

# 4-bit float quantization stored as Int8
from bitsandbytes import functional as F
# 4-bit float with 2 exponent bits
code = F.create_fp8_map(signed=True, exponent_bits=2, precision_bits=1, total_bits=4).cuda()
q, state = F.quantize_blockwise(X, code=code) # q has 4-bit indices which represent values in the codebook
X = F.dequantize_blockwise(q, state)

0.36.0: Improvements, Ada/Hopper support, fake k-bit quantization.

Features:

  • CUDA 11.8 and 12.0 support added
  • support for Ada and Hopper GPUs added (compute capability 8.9 and 9.0)
  • support for fake k-bit block-wise quantization for Int, Float, quantile quantization, and dynamic exponent data types added
  • Added CUDA instruction generator to fix some installations.
  • Added additional block sizes for quantization {64, 128, 256, 512, 1024}
  • Added SRAM Quantile algorithm to quickly estimate less than 256 quantiles
  • Added option to suppress the bitsandbytes welcome message (@Cyberes)

Regression:

  • Compute capability 3.0 removed: GTX 600s and 700s series is no longer supported (except GTX 780 and GTX 780 Ti)

Bug fixes:

  • fixed a bug where too long directory names would crash the CUDA SETUP #35 (@tomaarsen)
  • fixed a bug where CPU installations on Colab would run into an error #34 (@tomaarsen)
  • fixed an issue where the default CUDA version with fast-DreamBooth was not supported #52
  • fixed a bug where the CUDA setup failed due to a wrong function call.
  • fixed a bug in the CUDA Setup which led to an incomprehensible error if no GPU was detected.
  • fixed a bug in the CUDA Setup failed with the cuda runtime was found, but not the cuda library.
  • fixed a bug where not finding the cuda runtime led to an incomprehensible error.
  • fixed a bug where with missing CUDA the default was an error instead of the loading the CPU library
  • fixed a bug where the CC version of the GPU was not detected appropriately (@BlackHC)
  • fixed a bug in CPU quantization which lead to errors when the input buffer exceeded 2^31 elements

Improvements:

  • multiple improvements in formatting, removal of unused imports, and slight performance improvements (@tomaarsen)
  • StableEmbedding layer now has device and dtype parameters to make it 1:1 replaceable with regular Embedding layers (@lostmsu)
  • runtime performance of block-wise quantization slightly improved
  • added error message for the case multiple libcudart.so are installed and bitsandbytes picks the wrong one

CUDA 11.8 Support for Dreambooth finetuning

10 Oct 03:16
Compare
Choose a tag to compare

0.35.0

CUDA 11.8 support and bug fixes

Features:

  • CUDA 11.8 support added and binaries added to the PyPI release.

Bug fixes:

  • fixed a bug where too long directory names would crash the CUDA SETUP #35 (thank you @tomaarsen)
  • fixed a bug where CPU installations on Colab would run into an error #34 (thank you @tomaarsen)
  • fixed an issue where the default CUDA version with fast-DreamBooth was not supported #52