New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fastfunc] Use the python2/oils string model in Utf8DecodeOne #1967
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for following up on this! (delayed review due to Mother's day :) )
cpp/data_lang.cc
Outdated
// terminator (also setting UTF8_ERR_END_OF_STREAM). | ||
assert(0 <= start && start <= len(s)); | ||
// Bounds check for safety | ||
assert(0 <= start && start < len(s)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be DCHECK(), which is on everywhere except the release build
(it doesn't exist in CPython, only in our C++)
pyext/fastfunc.c
Outdated
// utf8_decode treats zero-bytes as C-style NUL-terminators. But python2/oils | ||
// strings treat these as zero-codepoints. Translate END_OF_STREAM errors | ||
// (resulting from zero-bytes) to valid zero-codepoints. | ||
if (decode_result.error == UTF8_ERR_END_OF_STREAM) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm it seems like this works but isn't it cleaner to not return UTF8_ERR_END_OF_STREAM in the first place?
Basically I imagine
0xce 0x00
-- truncated encoding relies NUL bytes, returns UTF8_ERR_TRUNCATED_BYTES0x00
after some other valid encoding -- this is UTF8_OK
I think it's possible to distinguish these cases in utf8decode()? Or would it also need a length?
Thinking about it a little more, I think the caller is responsible for not calling utf8_decode() on the trailing NUL, in the valid case Because you don't want to get UTF8_OK in that case But in the invalid case, the NUL that the caller supplied IS read, and that's OK, and it's necessary to return UTF8_TRUNCATED_BYTES In other words, I think we actually don't need UTF8_END_OF_STREAM at all? I think we can just get rid of it, and make the caller is responsible
It is a bit weird and subtle, but I think it makes sense (and this issue is why I was initially confused about the whole state machine / "inverting" the Crockford code) |
Yeah, you were right about removing the This is ready for another review. |
Looks very nice now, thanks! |
(thought) One way to think about this is that utf8_decode() does NOT take a NUL terminated string It is more like an "unsafe transducer" that takes a pointer, sorta like J8EncodeOne() and ShellEncodeOne() It does "one" thing which happens to involve a variable number of bytes processed |
Per the discussion in #1965, I've made the following updates:
fastfunc.Utf8DecodeOne
"safe" by exposing the python2/oils string model (over the C one)osh/string_ops.py
UTF-8 functions (fastfunc.Utf8DecodeOne
now handles that)Str => trim*()
are resilient to zero-codepoints