src: add node:encoding module #45823

anonrig · 2022-12-11T22:31:38Z

This is a proposal, and depends on my previous pull request

In summary, node:encoding enables unicode validation and transcoding (if needed). We can also go with a node:unicode name if users are confused with naming.

At the moment, I only added the following methods:

const encoding = require('node:encoding')

encoding.validateAscii(input: Buffer | Uint8Array | string): boolean
encoding.validateUtf8(input: Buffer | Uint8Array): boolean 
encoding.countUtf8(input: Buffer | Uint8Array)

TODO:

Add fast API
Add benchmarks
Add countUtf8ByteLength() and replace internal usage of node_buffer.cc byteLengthUtf8
Add normalize() for normalizing encodings

nodejs-github-bot · 2022-12-11T22:31:42Z

Review requested:

@nodejs/gyp
@nodejs/modules
@nodejs/startup

bnoordhuis · 2022-12-11T22:53:22Z

I'm the author of node-iconv and in my experience users care only about two things:

Convert a buffer from one encoding to another.
Like (1) but in a streaming fashion.

A handful of people also care about:

Transliteration (character substitution when conversion is lossy, e.g. utf-8 to ascii)

anonrig · 2022-12-11T23:03:49Z

I'm the author of node-iconv and in my experience users care only about two things:

Nice meeting with you @bnoordhuis, and thanks for your work on node-iconv 🎉

Convert a buffer from one encoding to another.

Would it make sense to have a convert(input, fromEncoding, toEncoding)?

jasnell · 2022-12-12T02:58:43Z

@anonrig that already exists. See buffer.transcode.

I don't think we need a new top level module. If these are added, they should be added to the buffer module. I think of these, counting utf8 bytes is likely the most useful.

mcollina · 2022-12-12T06:24:00Z

Could this also have a faster equivalent of Buffer.byteLength(string)? Or could that be used to speed that up?

benjamingr · 2022-12-12T08:15:28Z

@anonrig fwiw Ben is a project member for 10+ years with over 2000 commits to core. Or roughly 15 times more commits than you and I combined ^^

As for the change, what's the motivation? performance?

doc/api/encoding.md

anonrig · 2022-12-15T20:29:01Z

Could this also have a faster equivalent of Buffer.byteLength(string)? Or could that be used to speed that up?

@mcollina Once we're aligned with the path of this pull request, I'm going to add fast-api calls which will also speed up Byte.byteLength for UTF-8.

As for the change, what's the motivation? performance?

@benjamingr In my previous pull request (which is waiting to be reviewed at the moment), I added simdutf which provides performance boost to certain UTF operations. With this pull request, I wanted to enable more operations for all types of input, not specifically for buffer, and expose them in a new way to make drastic changes in naming conventions and provide direct access to certain operations like isAscii or isUtf8 or countUtf8(which is similar to Buffer.byteLength(input, 'utf8')) which does a lot of validations before reaching the path for utf8. Additionally, isUtf8 is helpful for Undici team.

I don't think we need a new top level module. If these are added, they should be added to the buffer module. I think of these, counting utf8 bytes is likely the most useful.

@jasnell IMHO, I don't believe that buffer is the correct place even if we expose it using require('buffer').isAscii because it does not include the context of encoding and will eventually bloat the buffer module with all different combinations of validations & operations specific to encoding. On the other hand: isAscii(input) also accepts string as an input, and it has nothing to do with buffer name.

cjihrig · 2022-12-15T21:01:56Z

I also have concerns about adding a new top level module here. As James mentioned, it's debatable if this functionality warrants a top level module. The very minimal initial API is also a bit of a concern. Furthermore, even with the node: prefix restriction, there is still a module on npm called encoding with 11M weekly downloads.

anonrig · 2022-12-22T21:23:57Z

I've made the decision to go with adding these methods to buffer module, and opened the first pull request. Appreciate the reviews: #45947

lemire · 2022-12-22T21:30:20Z

The high-speed native-build utf-8-validate by @lpinca has 1.3 M weekly downloads.

anonrig · 2023-01-01T21:07:55Z

I opened a new pull request for adding buffer.isAscii, therefore I'm closing this pull request. Thank you for all your reviews & time.

jimmywarting · 2023-01-02T22:24:33Z

Hmm, not a particular fan of adding even more new features onto node:buffer.
Some are staring away from using node:buffer for better cross compatible reasons by simply just using Uint8Array instead.

nodejs-github-bot added lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels Dec 11, 2022

targos reviewed Dec 12, 2022

View reviewed changes

doc/api/encoding.md Outdated Show resolved Hide resolved

deps: add simdutf dependency

b2d4c4e

src: add node:encoding module

8df4bbf

anonrig force-pushed the feat/node-encoding branch from 87e3335 to 8df4bbf Compare December 15, 2022 20:30

anonrig mentioned this pull request Jan 1, 2023

buffer: add isAscii method #46046

Merged

anonrig closed this Jan 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src: add node:encoding module #45823

src: add node:encoding module #45823

anonrig commented Dec 11, 2022 •

edited

nodejs-github-bot commented Dec 11, 2022

bnoordhuis commented Dec 11, 2022

anonrig commented Dec 11, 2022

jasnell commented Dec 12, 2022

mcollina commented Dec 12, 2022

benjamingr commented Dec 12, 2022

anonrig commented Dec 15, 2022

cjihrig commented Dec 15, 2022

anonrig commented Dec 22, 2022

lemire commented Dec 22, 2022

anonrig commented Jan 1, 2023

jimmywarting commented Jan 2, 2023

src: add node:encoding module #45823

src: add node:encoding module #45823

Conversation

anonrig commented Dec 11, 2022 • edited

nodejs-github-bot commented Dec 11, 2022

bnoordhuis commented Dec 11, 2022

anonrig commented Dec 11, 2022

jasnell commented Dec 12, 2022

mcollina commented Dec 12, 2022

benjamingr commented Dec 12, 2022

anonrig commented Dec 15, 2022

cjihrig commented Dec 15, 2022

anonrig commented Dec 22, 2022

lemire commented Dec 22, 2022

anonrig commented Jan 1, 2023

jimmywarting commented Jan 2, 2023

anonrig commented Dec 11, 2022 •

edited