Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: add node:encoding module #45823

Closed
wants to merge 2 commits into from
Closed

Conversation

anonrig
Copy link
Member

@anonrig anonrig commented Dec 11, 2022

This is a proposal, and depends on my previous pull request

In summary, node:encoding enables unicode validation and transcoding (if needed). We can also go with a node:unicode name if users are confused with naming.

At the moment, I only added the following methods:

const encoding = require('node:encoding')

encoding.validateAscii(input: Buffer | Uint8Array | string): boolean
encoding.validateUtf8(input: Buffer | Uint8Array): boolean 
encoding.countUtf8(input: Buffer | Uint8Array)

TODO:

  • Add fast API
  • Add benchmarks
  • Add countUtf8ByteLength() and replace internal usage of node_buffer.cc byteLengthUtf8
  • Add normalize() for normalizing encodings

@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/gyp
  • @nodejs/modules
  • @nodejs/startup

@nodejs-github-bot nodejs-github-bot added lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels Dec 11, 2022
@bnoordhuis
Copy link
Member

I'm the author of node-iconv and in my experience users care only about two things:

  1. Convert a buffer from one encoding to another.

  2. Like (1) but in a streaming fashion.

A handful of people also care about:

  1. Transliteration (character substitution when conversion is lossy, e.g. utf-8 to ascii)

@anonrig
Copy link
Member Author

anonrig commented Dec 11, 2022

I'm the author of node-iconv and in my experience users care only about two things:

Nice meeting with you @bnoordhuis, and thanks for your work on node-iconv 🎉

  1. Convert a buffer from one encoding to another.

Would it make sense to have a convert(input, fromEncoding, toEncoding)?

@jasnell
Copy link
Member

jasnell commented Dec 12, 2022

@anonrig that already exists. See buffer.transcode.

I don't think we need a new top level module. If these are added, they should be added to the buffer module. I think of these, counting utf8 bytes is likely the most useful.

@mcollina
Copy link
Member

Could this also have a faster equivalent of Buffer.byteLength(string)? Or could that be used to speed that up?

@benjamingr
Copy link
Member

@anonrig fwiw Ben is a project member for 10+ years with over 2000 commits to core. Or roughly 15 times more commits than you and I combined ^^

As for the change, what's the motivation? performance?

doc/api/encoding.md Outdated Show resolved Hide resolved
@anonrig
Copy link
Member Author

anonrig commented Dec 15, 2022

Could this also have a faster equivalent of Buffer.byteLength(string)? Or could that be used to speed that up?

@mcollina Once we're aligned with the path of this pull request, I'm going to add fast-api calls which will also speed up Byte.byteLength for UTF-8.

As for the change, what's the motivation? performance?

@benjamingr In my previous pull request (which is waiting to be reviewed at the moment), I added simdutf which provides performance boost to certain UTF operations. With this pull request, I wanted to enable more operations for all types of input, not specifically for buffer, and expose them in a new way to make drastic changes in naming conventions and provide direct access to certain operations like isAscii or isUtf8 or countUtf8(which is similar to Buffer.byteLength(input, 'utf8')) which does a lot of validations before reaching the path for utf8. Additionally, isUtf8 is helpful for Undici team.

I don't think we need a new top level module. If these are added, they should be added to the buffer module. I think of these, counting utf8 bytes is likely the most useful.

@jasnell IMHO, I don't believe that buffer is the correct place even if we expose it using require('buffer').isAscii because it does not include the context of encoding and will eventually bloat the buffer module with all different combinations of validations & operations specific to encoding. On the other hand: isAscii(input) also accepts string as an input, and it has nothing to do with buffer name.

@cjihrig
Copy link
Contributor

cjihrig commented Dec 15, 2022

I also have concerns about adding a new top level module here. As James mentioned, it's debatable if this functionality warrants a top level module. The very minimal initial API is also a bit of a concern. Furthermore, even with the node: prefix restriction, there is still a module on npm called encoding with 11M weekly downloads.

@anonrig
Copy link
Member Author

anonrig commented Dec 22, 2022

I've made the decision to go with adding these methods to buffer module, and opened the first pull request. Appreciate the reviews: #45947

@lemire
Copy link
Member

lemire commented Dec 22, 2022

The high-speed native-build utf-8-validate by @lpinca has 1.3 M weekly downloads.

@anonrig
Copy link
Member Author

anonrig commented Jan 1, 2023

I opened a new pull request for adding buffer.isAscii, therefore I'm closing this pull request. Thank you for all your reviews & time.

@anonrig anonrig closed this Jan 1, 2023
@jimmywarting
Copy link

Hmm, not a particular fan of adding even more new features onto node:buffer.
Some are staring away from using node:buffer for better cross compatible reasons by simply just using Uint8Array instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants