Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fastfunc] Use the python2/oils string model in Utf8DecodeOne #1967

Merged
merged 8 commits into from May 17, 2024

Conversation

PossiblyAShrub
Copy link
Collaborator

Per the discussion in #1965, I've made the following updates:

  • Made the fastfunc.Utf8DecodeOne "safe" by exposing the python2/oils string model (over the C one)
  • Removed the NUL special casing in our osh/string_ops.py UTF-8 functions (fastfunc.Utf8DecodeOne now handles that)
  • Added some tests to validate that string APIs like Str => trim*() are resilient to zero-codepoints

Copy link
Contributor

@andychu andychu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up on this! (delayed review due to Mother's day :) )

cpp/data_lang.cc Outdated
// terminator (also setting UTF8_ERR_END_OF_STREAM).
assert(0 <= start && start <= len(s));
// Bounds check for safety
assert(0 <= start && start < len(s));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be DCHECK(), which is on everywhere except the release build

(it doesn't exist in CPython, only in our C++)

pyext/fastfunc.c Outdated
// utf8_decode treats zero-bytes as C-style NUL-terminators. But python2/oils
// strings treat these as zero-codepoints. Translate END_OF_STREAM errors
// (resulting from zero-bytes) to valid zero-codepoints.
if (decode_result.error == UTF8_ERR_END_OF_STREAM) {
Copy link
Contributor

@andychu andychu May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm it seems like this works but isn't it cleaner to not return UTF8_ERR_END_OF_STREAM in the first place?

Basically I imagine

  • 0xce 0x00 -- truncated encoding relies NUL bytes, returns UTF8_ERR_TRUNCATED_BYTES
  • 0x00 after some other valid encoding -- this is UTF8_OK

I think it's possible to distinguish these cases in utf8decode()? Or would it also need a length?

@andychu
Copy link
Contributor

andychu commented May 13, 2024

Thinking about it a little more, I think the caller is responsible for not calling utf8_decode() on the trailing NUL, in the valid case

Because you don't want to get UTF8_OK in that case

But in the invalid case, the NUL that the caller supplied IS read, and that's OK, and it's necessary to return UTF8_TRUNCATED_BYTES


In other words, I think we actually don't need UTF8_END_OF_STREAM at all? I think we can just get rid of it, and make the caller is responsible

  1. NUL terminating every string
  2. not over-running the buffer -- i.e. don't call it when on the NUL past the end of the string, only ones before the end of the string (which it knows)

It is a bit weird and subtle, but I think it makes sense

(and this issue is why I was initially confused about the whole state machine / "inverting" the Crockford code)

@PossiblyAShrub
Copy link
Collaborator Author

Yeah, you were right about removing the END_OF_STREAM error state; it simplified the code while preserving correctness. The "nul-terminator required but you must keep track of the buffer end" rule is certainly subtle, so I made sure to note it in the doc-comment.

This is ready for another review.

@andychu andychu changed the base branch from master to soil-staging May 17, 2024 05:41
@andychu andychu merged commit 6b787ad into soil-staging May 17, 2024
18 checks passed
@andychu
Copy link
Contributor

andychu commented May 17, 2024

Looks very nice now, thanks!

@andychu
Copy link
Contributor

andychu commented May 17, 2024

(thought)

One way to think about this is that utf8_decode() does NOT take a NUL terminated string

It is more like an "unsafe transducer" that takes a pointer, sorta like J8EncodeOne() and ShellEncodeOne()

It does "one" thing which happens to involve a variable number of bytes processed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants