Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base64 proposal #372

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
273 changes: 273 additions & 0 deletions proposals/stdlib/base64.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
# Base64

* **Type**: Standard Library API proposal
* **Author**: Abduqodiri Qurbonzoda
* **Contributors**: Ilya Gorbunov, Roman Elizarov, Vsevolod Tolstopyatov, Dmitry Khalanskiy
* **Status**: Implemented in Kotlin 1.8.20
* **Prototype**: Implemented
* **Target issue**: [KT-57762](https://youtrack.jetbrains.com/issue/KT-55978/)
* **Discussion**: TBD

## Summary

Introduce a convenient API for Base64 encoding and decoding for converting binary data to a printable string and vice versa.

## Motivation

As Kotlin strives to become a multiplatform programming language, there is a growing demand for features that work across all platforms.
Base64 is one such feature. The absence of a multiplatform Base64 complicates the process of sharing code between multiple targets.
We have received multiple requests from users for a Kotlin/Common API for Base64 encoding and decoding. Here are a few examples:
* https://discuss.kotlinlang.org/t/how-to-use-base64-in-multiplatform-ios-android/19755
* https://discuss.kotlinlang.org/t/kotlin-native-base64-en-decoder-code/10043
* https://youtrack.jetbrains.com/issue/KT-9823/Stdlib-should-include-base16-32-64-encoding-and-decoding
* https://youtrack.jetbrains.com/issue/KT-43185/Add-base64-String-ByteArray-en-and-decoding-to-the-standard-library
* https://youtrack.jetbrains.com/issue/KT-6695/add-extension-function-to-turn-ByteArray-into-a-base64-String

Base64 is one of the most widely used binary-to-text encodings for several fundamental reasons:
* Base64 converts binary data into an ASCII string relatively efficiently.
It increases the data size by only about 33%.
* Historically, the ASCII characters used in Base64 encoding have been compatible with various systems and protocols.
This allows encoded data to be easily transmitted and stored in systems that may only support text data.
* The implementation of a Base64 encoder/decoder is relatively simple and fast.
Moreover, the encoding scheme is well-documented in standards such as [RFC 4648](https://www.rfc-editor.org/rfc/rfc4648#section-4).

The benefits listed above, combined with the wide availability of implementations in virtually all programming languages,
make it easy to understand why it is so ubiquitous.
By providing a convenient API for common use cases described below, we aim to make coding in Kotlin easier and more
enjoyable. Additionally, it streamlines code sharing across platforms.

## Use cases

### Storing or transmitting binary data in text-only formats

Sometimes binary data needs to be embedded into text-only formats such as JSON or XML,
making it easier to transmit over HTTP. Similarly, this encoding allows users to embed images directly into HTML or CSS files.
Our research indicates that in this use case, Base64 encoding is among the most frequently, if not the most, used encoding.

### Include binary data in URLs

Base64 encoding is used to embed binary data, such as small images, into URLs. It reduces the number of server requests.
Because the '+' and '/' characters used in the default Base64 encoding scheme are special characters in URLs,
they are replaced with '-' and '_' correspondingly in this use case.

### Email and MIME encoding

Email systems use Base64 to encode attachments. This is necessary because email protocols are primarily text-based and
cannot handle raw binary data directly. In this use case, the encoded result is usually split into multiple lines.


## Similar API review

The following similar APIs were analyzed:
* Java [`Base64`](https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html) class.
* Guava [`BaseEncoding`](https://guava.dev/releases/snapshot-jre/api/docs/com/google/common/io/BaseEncoding.html) class.
* Python [`base64`](https://docs.python.org/3/library/base64.html) module.
* Go [`base64`](https://pkg.go.dev/encoding/base64) package.
* Swift `Data` [methods](https://developer.apple.com/documentation/foundation/data#2888198) and [initializers](https://developer.apple.com/documentation/foundation/data#2890507).

## Proposal

Considering the use cases mentioned above, it is proposed to support the following three variants of Base64:
1. The standard Base 64 encoding scheme described in [`RFC 4648 section 4`](https://www.rfc-editor.org/rfc/rfc4648#section-4)
2. URL and filename safe encoding scheme described in [`RFC 4648 section 5`](https://www.rfc-editor.org/rfc/rfc4648#section-5)
3. MIME Base 64 encoding scheme described in [`RFC 2045 section 6.8`](https://www.rfc-editor.org/rfc/rfc2045#section-6.8)

For all encoding variants, it is proposed to support the following operations:
* Encode binary data (`ByteArray`) and return the result in `String` form.
* Decode Base64-encoded `CharSequence` and return the resulting binary data (`ByteArray`).

Quite often, the encode result is directed into a byte stream. To facilitate the use case and avoid `String` to
`ByteArray` conversion, the following operations are proposed:
* Encode binary data (`ByteArray`) and return the result in `ByteArray` form.
* Decode Base64-encoded `ByteArray` and return the resulting binary data (`ByteArray`).

Each Base64 character occupies one byte in `ByteArray`, represented by its [`code`](https://kotlinlang.org/api/latest/jvm/stdlib/kotlin/code.html).

In scenarios where encoding or decoding is performed in chunks, or when the reuse of the buffer between multiple
operations is required, the following operations are proposed:
* Encode binary data (`ByteArray`) and append the result to a specified `Appendable`, returning the destination appendable.
* Decode Base64-encoded `CharSequence` and write the result to a specified `ByteArray`, returning the number of bytes written.
* Encode binary data (`ByteArray`) and write the result to a specified `ByteArray`, returning the number of bytes written.
* Decode Base64 encoded `ByteArray` and write the result to a specified `ByteArray`, returning the number of bytes written.

### Proposed API

It is proposed to introduce a `Base64` class that declare the operations above as member functions:
```kotlin
public open class Base64 {

public fun encode(source: ByteArray, startIndex: Int = 0, endIndex: Int = source.size): String

public fun <A : Appendable> encodeToAppendable(
source: ByteArray,
destination: A,
startIndex: Int = 0,
endIndex: Int = source.size
): A

public fun encodeToByteArray(source: ByteArray, startIndex: Int = 0, endIndex: Int = source.size): ByteArray

public fun encodeIntoByteArray(
source: ByteArray,
destination: ByteArray,
destinationOffset: Int = 0,
startIndex: Int = 0,
endIndex: Int = source.size
): Int

public fun decode(source: CharSequence, startIndex: Int = 0, endIndex: Int = source.length): ByteArray

public fun decodeIntoByteArray(
source: CharSequence,
destination: ByteArray,
destinationOffset: Int = 0,
startIndex: Int = 0,
endIndex: Int = source.length
): Int

public fun decode(source: ByteArray, startIndex: Int = 0, endIndex: Int = source.size): ByteArray

public fun decodeIntoByteArray(
source: ByteArray,
destination: ByteArray,
destinationOffset: Int = 0,
startIndex: Int = 0,
endIndex: Int = source.size
): Int

}
```

However, the class won't have public constructors. It is not supposed to be instantiated or inherited.
Instead, a predefined instance of this class will be provided for each supported Base64 variant:
```kotlin
public open class Base64 {

public companion object Default : Base64() {
public val UrlSafe: Base64
public val Mime: Base64
}

}
```

### Decoding and encoding of input and output streams in Kotlin/JVM

On JVM, it is beneficial to provide an input stream that decodes Base64 symbols from the underlying input stream and supplies binary data:
```kotlin
public fun InputStream.decodingWith(base64: Base64): InputStream
```
Similarly, an output stream that encodes the supplied binary data and writes Base64 symbols to the underlying output stream:
```kotlin
public fun OutputStream.encodingWith(base64: Base64): OutputStream
```

These functions are particularly handy when the data being encoded or decoded is large, and loading it all into memory
at once in not desirable. For example, they can be used for encoding an attachment from the file system and embedding it
into an email.

## Contracts

For encode operations:
* All `Base64` variants pad the result with `'='` to make it an integral multiple of 4 symbols if the number of bytes to be encoded is not an integral multiple of 3.
* `Base64.Default` and `Base64.UrlSafe` do not separate the result into multiple lines.
* `Base64.Mime` adds CRLF every 76 symbols, but it does not add a CRLF at the end of the encoded output.

For decode operations:
* All `Base64` variants do not require the symbols to be padded. However, if a padding character is present, the correct amount of it is required.
* All `Base64` variants interpret the padding character as the end of the encoded binary data, and subsequent symbols are prohibited.
* All `Base64` variants allow the padded bits to be non-zero.
* `Base64.Default` throws if it encounters a character outside "The Base64 Alphabet" specified in Table 1 of RFC 4648.
* `Base64.UrlSafe` throws if it encounters a character outside "The URL and Filename safe Base 64 Alphabet" specified in Table 2 of RFC 4648.
* `Base64.Mime` ignores all line separators and other characters outside "The Base64 Alphabet" specified in Table 1 of RFC 2045.

As one can see, encoding and decoding adhere to the RFC standards. However, the standards do not specify all corner
cases and allow deviations in some areas. For example, they do not mandate that decoders verify the padded bits are zeros,
and allow for both padded and non-padded encodings. Additionally, there is no consensus among the implementations available in
other programming languages and libraries. Therefore, we have selected a strict approach for encoding and a lenient
approach for decoding to maximize interoperability with Base64 strings generated by other libraries. As a result, while
different binary data will be encoded into different strings, different strings may be decoded into the same binary data.

### Examples

This is an example of how the proposed Base64 API could be used:

```kotlin
val foBytes = "fo".map { it.code.toByte() }.toByteArray()
Base64.Default.encode(foBytes) // "Zm8="
// Alternatively:
// Base64.encode(foBytes)

val foobarBytes = "foobar".map { it.code.toByte() }.toByteArray()
Base64.UrlSafe.encode(foobarBytes) // "Zm9vYmFy"

Base64.Default.decode("Zm8=") // foBytes
// Alternatively:
// Base64.decode("Zm8=")

Base64.UrlSafe.decode("Zm9vYmFy") // foobarBytes
```

## Alternatives

Because Kotlin didn't provide a Common API for Base64 encoding, community has implemented a few libraries targeting this gap:
* https://github.com/05nelsonm/encoding
* https://github.com/saschpe/Kase64

Providing Base64 encoding/decoding directly in the Kotlin standard library avoids additional dependencies,
ensures the API is consistent with other Kotlin stdlib API, and ensures it has a good performance.

## Naming

### Existing functions for encoding a String (UTF-16) to UTF-8 and decoding back

```kotlin
public fun String.encodeToByteArray(...): ByteArray

public fun ByteArray.decodeToString(...): String
```

### Proposed naming

When researching usages of Java `Base64`, we observed two patterns that influenced our design and naming.
First, the vast majority of users obtain an encoder or decoder and use it immediately,
e.g., `Base64.getDecoder().decode(...)`. This led us to avoid separating the encoder and decoder into different entities.
We also didn't find evidence that the benefits of such a separation outweigh the usage inconvenience.
Second, encoding to `String` and decoding from `String` are considerably more popular than encoding to `ByteArray` and
decoding from `ByteArray`. Thus, we decided to make the more frequently used functions shorter.

* Encode and return the result in `String` form: `encode`.
* Encode and return the result in `ByteArray` form: `encodeToByteArray`.
* Encode and append the result to a specified `Appendable`: `encodeToAppendable`.
* Encode and write the result to a specified `ByteArray`: `encodeIntoByteArray`.
* Decode (a `CharSequence` or `ByteArray`) and return the resulting `ByteArray`: `decode`.
* Decode (a `CharSequence` or `ByteArray`) and write the result to a specified `ByteArray`: `decodeIntoByteArray`.

Any naming suggestions are welcome.

## Dependencies

Only a subset of Kotlin Standard Library available on all supported platforms is required.

## Placement

* Standard Library
* `kotlin.io.encoding` package

## Reference implementation

* `Base64` class: https://github.com/JetBrains/kotlin/blob/master/libraries/stdlib/src/kotlin/io/encoding/Base64.kt
* `InputStream` and `OutputStream` extensions: https://github.com/JetBrains/kotlin/blob/master/libraries/stdlib/jvm/src/kotlin/io/encoding/Base64IOStream.kt
* Test cases for `Base64` member functions: https://github.com/JetBrains/kotlin/blob/master/libraries/stdlib/test/io.encoding/Base64Test.kt
* Test cases for `InputStream` and `OutputStream` extensions: https://github.com/JetBrains/kotlin/blob/master/libraries/stdlib/jvm/test/io/Base64IOStreamTest.kt

## Future advancements

* Adding the ability encode without padding
* e.g., `fun Base64.withoutPadding(): Base64` could be introduced
* Related request: [KT-60787](https://youtrack.jetbrains.com/issue/KT-57998/)
* Adding the ability to force presence or absence of padding chars when decoding
* e.g., `fun Base64.strictPaddingRules(): Base64` could be introduced
* Adding ability to force the padded bits are zeros
* e.g., `fun Base64.strictPaddedBits(): Base64` could be introduced
* Related request: [KT-56135](https://youtrack.jetbrains.com/issue/KT-56135/)