Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large strings decoded from latin1 and then encoded to utf8 has wrong size #22728

Closed
holsdjakne opened this issue Sep 6, 2018 · 5 comments
Closed
Labels
buffer Issues and PRs related to the buffer subsystem. confirmed-bug Issues with confirmed bugs. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs.

Comments

@holsdjakne
Copy link

holsdjakne commented Sep 6, 2018

  • Version: v8.11.4
  • Platform: Linux xyz 4.17.5-200.fc28.x86_64 deps: update openssl to 1.0.1j #1 SMP Tue Jul 10 13:39:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Subsystem:

Decoding a latin1 buffer larger than about 1MB to a string and encoding that string into utf-8 gives a buffer with the same number of bytes as the latin1 input even though more are required for characters that use more space in utf-8.

This seems to work properly on v10.x but not v8.x or v9.x.

Code that demonstrates the problem:

const fs = require('fs');
const s = 'Räksmörgås';
let ss = '';

const SIZE = (1024 * 1024);
// works:
//const SIZE = (1024 * 512);

while (ss.length < SIZE) {
    ss = ss + ss.length + ' ' + s + '\n';
}

// create latin1 buffer we can decode
let l1Buffer = Buffer.from(ss, 'latin1');
let l1String = l1Buffer.toString('latin1')
// also fixes it:
// l1String = ('x' + l1String).substring(1, l1String.length + 1);
// create utf8 buffer from decoded latin1 string
let u8Buffer = Buffer.from(l1String, 'utf8')

console.log(l1Buffer.length);
console.log(u8Buffer.length);

if (l1Buffer.length === u8Buffer.length) {
    console.log('failed, should be different size');
} else {
    console.log('works');
}
@jasnell jasnell added the v8.x label Sep 6, 2018
@jasnell
Copy link
Member

jasnell commented Sep 6, 2018

Quick test on 8.11.4 confirms the difference in behavior between Node.js 8 and 10.

/cc @nodejs/intl @nodejs/buffer @srl295

@jasnell jasnell added the buffer Issues and PRs related to the buffer subsystem. label Sep 6, 2018
@vsemozhetbyt vsemozhetbyt added the encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. label Sep 6, 2018
@addaleax addaleax added the confirmed-bug Issues with confirmed bugs. label Sep 6, 2018
@addaleax
Copy link
Member

addaleax commented Sep 6, 2018

This was fixed by #18216, I think.

@jasnell
Copy link
Member

jasnell commented Sep 6, 2018

Yep, makes sense. That PR is still pending a backport to 8.x. Once the backport happens, this issue should be resolved.

@addaleax
Copy link
Member

addaleax commented Sep 6, 2018

→ Backport is in #22731

addaleax pushed a commit to addaleax/node that referenced this issue Oct 2, 2018
* Respect `encoding` argument when the string is externalized.

* Copy the string when the write request can outlive the externalized
  string.

This commit removes `StringBytes::GetExternalParts()` because it is
fundamentally broken.

Fixes: nodejs#18146
Fixes: nodejs#22728
PR-URL: nodejs#18216
BethGriggs pushed a commit that referenced this issue Oct 16, 2018
* Respect `encoding` argument when the string is externalized.

* Copy the string when the write request can outlive the externalized
  string.

This commit removes `StringBytes::GetExternalParts()` because it is
fundamentally broken.

Fixes: #18146
Fixes: #22728
Backport-PR-URL: #22731
PR-URL: #18216

Reviewed-By: James M Snell <jasnell@gmail.com>
@starkwang
Copy link
Contributor

This issue is fixed in #18216 and #22731 so I'd like to close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. confirmed-bug Issues with confirmed bugs. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs.
Projects
None yet
Development

No branches or pull requests

5 participants