Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JavaScript String length 转换为 byteLength #47

Open
xiaoxiaojx opened this issue Nov 8, 2022 · 0 comments
Open

JavaScript String length 转换为 byteLength #47

xiaoxiaojx opened this issue Nov 8, 2022 · 0 comments
Labels
Node.js Node.js® is a JavaScript runtime built on Chrome's V8 JavaScript engine.

Comments

@xiaoxiaojx
Copy link
Owner

xiaoxiaojx commented Nov 8, 2022

看 Node.js 中这段代码较为疑惑, 如下 StorageSize 函数通过 JavaScript 字符串的长度(str->Length)估算出转换为 UTF-8 编码会占用的最大字节数为 3 * str->Length(), 说明长度为 1 的 JavaScript 字符串转换为 UTF-8 编码最多需要 3 个字节来存储, 那么这个结论是如何得出来的?

Maybe<size_t> StringBytes::StorageSize(Isolate* isolate,
                                       Local<Value> val,
                                       enum encoding encoding) {
  HandleScope scope(isolate);
  size_t data_size = 0;

  Local<String> str;
  if (!val->ToString(isolate->GetCurrentContext()).ToLocal(&str))
    return Nothing<size_t>();

  switch (encoding) {
    // 省略 ...
    
    case UTF8:
      data_size = 3 * str->Length();
      break

    default:
      CHECK(0 && "unknown encoding");
      break;
  }

  return Just(data_size);
}

MDN String length: The length read-only property of a string contains the length of the string in UTF-16 code units.

通过查阅 MDN 发现 JavaScript 字符串是 UTF-16 编码, 那么 UTF-16 的编码规则是怎样的了?

image

通过查阅 UTF-16 维基百科 发现 UTF-16 共2种情况的编码, 码点范围 0-65535 的字符在 UTF-16 是 2 个字节, 65536 以上为 4 个字节

最后我们再查阅 统一码 百度百科 发现 UTF-8 共4种情况的编码

image

于是我们可以从码点范围 0-127, 128-2047, 2048-65535 中任意取一个数来验证 JavaScript 字符串的长度与转换为 UTF-8 编码的字节数的关系

// 码点范围 0~127, "1".codePointAt(): 49
"1".length
// length: 1, Buffer.byteLength("1", "utf16le"): 2, Buffer.byteLength("1", "utf8"): 1

// 码点范围 128~2047, "®".codePointAt(): 174
"®".length
// length: 1, Buffer.byteLength("®", "utf16le"): 2, Buffer.byteLength("®", "utf8"): 2

// 码点范围 2048~65535, "多".codePointAt(): 22810
"多".length
// length: 1, Buffer.byteLength("多", "utf16le"): 2, Buffer.byteLength("多", "utf8"): 3

// 码点范围 65536~2097151, "𐀀".codePointAt(): 65536
"𐀀".length
// length: 2, Buffer.byteLength("𐀀", "utf16le"): 4, Buffer.byteLength("𐀀", "utf8"): 4

上面的验证结果来看, 当 JavaScript 字符串的长度为 1 时 UTF-8 编码的字节数可能为1个或者2个或者3个, ✅ 从而验证了 JavaScript 字符串转换为 UTF-8 编码最多需要 3 * str->Length() 个字节来存储。

@xiaoxiaojx xiaoxiaojx added the Node.js Node.js® is a JavaScript runtime built on Chrome's V8 JavaScript engine. label Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Node.js Node.js® is a JavaScript runtime built on Chrome's V8 JavaScript engine.
Projects
None yet
Development

No branches or pull requests

1 participant