Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use TextDecoder for toString('utf8') #286

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mischnic
Copy link
Contributor

@mischnic mischnic commented Jan 7, 2021

Closes #268

This uses TextDecoder for toString('utf8') and toString().

I needed to update some tests so that they are in line with Node's native Buffer (which also makes them pass with TextDecoder), I hope this was correct?


Technically, it also supports latin1, utf-16le, but the conversion is different from Node for strings that aren't representable in these encodings:

latin1:

Buffer content: 'Ö' = <UTF8 Buffer c3 96>
Output:
 TextDecoder latin1: 'Ö' <Buffer c3 13>
 Node Buffer latin1: '�' <Buffer c3 96>

utf16: (TextDecoder adds an "�" at the end. https://www.compart.com/en/unicode/U+FFFD)

Buffer content: 'abc' = <UTF8 Buffer 61 62 63>
Output:
 TextDecoder utf-16le: e6 89 a1 ef bf bd
 Node utf16le:         e6 89 a1

@martinheidegger
Copy link
Contributor

Great initiative @mischnic. I am using Buffer as a drop-in replacement for Node's versions. Changing the tests wouldn't work for me as then Buffer couldn't be used for that. Would it be possible to adjust the output of decoderUTF8.decode(buf.slice(start, end)) to adjust for this case? Maybe remove the "�" with utf16le and replace 0x13 with 0x96 with lading encoding?

@mischnic
Copy link
Contributor Author

mischnic commented Jan 11, 2021

I only adjusted those test which were apparently deviating from Node's Buffer. For example try running this

> new Buffer([0xF4, 0x8F, 0x80]).toString().length
1

so this test was apparently wrong

  t.equal(
    new B([0xF4, 0x8F, 0x80]).toString(),
    '\uFFFD\uFFFD\uFFFD'
  )

I only used TextDecoder for utf8 because it seems to align with Buffer.toString("utf8"). The handling of the other encodings (utf16, latin) is still the same.

@mischnic
Copy link
Contributor Author

mischnic commented Jun 5, 2021

@feross ?

@mischnic
Copy link
Contributor Author

mischnic commented Dec 8, 2021

There is apparently some breakeven point where TextDecoder becomes faster then the existing implementation:

Using node perf/readUtf8.js, testing 256 byte buffers and new Buffer('7c'.repeat(5e7), 'hex') for the "big" variants

master:
	BrowserBuffer#readUtf8 x 414,259 ops/sec ±2.91% (85 runs sampled)
	NodeBuffer#readUtf8 x 486,114 ops/sec ±3.01% (84 runs sampled)
	BrowserBuffer#readUtf8 big x 0.98 ops/sec ±5.58% (7 runs sampled)
	NodeBuffer#readUtf8 big x 34.31 ops/sec ±1.56% (58 runs sampled)

this:
	BrowserBuffer#readUtf8 x 195,525 ops/sec ±2.06% (86 runs sampled)
	NodeBuffer#readUtf8 x 486,587 ops/sec ±2.24% (79 runs sampled)
	BrowserBuffer#readUtf8 big x 18.61 ops/sec ±11.77% (38 runs sampled)
	NodeBuffer#readUtf8 big x 35.19 ops/sec ±1.76% (61 runs sampled)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use TextDecoder for Buffer.toString?
2 participants