[fastfunc] Use the python2/oils string model in Utf8DecodeOne #1967

PossiblyAShrub · 2024-05-12T04:56:34Z

Per the discussion in #1965, I've made the following updates:

Made the fastfunc.Utf8DecodeOne "safe" by exposing the python2/oils string model (over the C one)
Removed the NUL special casing in our osh/string_ops.py UTF-8 functions (fastfunc.Utf8DecodeOne now handles that)
Added some tests to validate that string APIs like Str => trim*() are resilient to zero-codepoints

andychu

Thanks for following up on this! (delayed review due to Mother's day :) )

andychu · 2024-05-13T18:41:43Z

cpp/data_lang.cc

-  // terminator (also setting UTF8_ERR_END_OF_STREAM).
-  assert(0 <= start && start <= len(s));
+  // Bounds check for safety
+  assert(0 <= start && start < len(s));


This can be DCHECK(), which is on everywhere except the release build

(it doesn't exist in CPython, only in our C++)

andychu · 2024-05-13T18:47:04Z

pyext/fastfunc.c

+  // utf8_decode treats zero-bytes as C-style NUL-terminators. But python2/oils
+  // strings treat these as zero-codepoints. Translate END_OF_STREAM errors
+  // (resulting from zero-bytes) to valid zero-codepoints.
+  if (decode_result.error == UTF8_ERR_END_OF_STREAM) {


Hm it seems like this works but isn't it cleaner to not return UTF8_ERR_END_OF_STREAM in the first place?

Basically I imagine

0xce 0x00 -- truncated encoding relies NUL bytes, returns UTF8_ERR_TRUNCATED_BYTES

0x00 after some other valid encoding -- this is UTF8_OK

I think it's possible to distinguish these cases in utf8decode()? Or would it also need a length?

andychu · 2024-05-13T18:52:40Z

Thinking about it a little more, I think the caller is responsible for not calling utf8_decode() on the trailing NUL, in the valid case

Because you don't want to get UTF8_OK in that case

But in the invalid case, the NUL that the caller supplied IS read, and that's OK, and it's necessary to return UTF8_TRUNCATED_BYTES

In other words, I think we actually don't need UTF8_END_OF_STREAM at all? I think we can just get rid of it, and make the caller is responsible

NUL terminating every string
not over-running the buffer -- i.e. don't call it when on the NUL past the end of the string, only ones before the end of the string (which it knows)

It is a bit weird and subtle, but I think it makes sense

(and this issue is why I was initially confused about the whole state machine / "inverting" the Crockford code)

…etect it

PossiblyAShrub · 2024-05-17T05:19:31Z

Yeah, you were right about removing the END_OF_STREAM error state; it simplified the code while preserving correctness. The "nul-terminator required but you must keep track of the buffer end" rule is certainly subtle, so I made sure to note it in the doc-comment.

This is ready for another review.

andychu · 2024-05-17T05:42:35Z

Looks very nice now, thanks!

andychu · 2024-05-17T06:21:50Z

(thought)

One way to think about this is that utf8_decode() does NOT take a NUL terminated string

It is more like an "unsafe transducer" that takes a pointer, sorta like J8EncodeOne() and ShellEncodeOne()

It does "one" thing which happens to involve a variable number of bytes processed

PossiblyAShrub added 3 commits May 11, 2024 22:42

Make fastfunc.Utf8DecodeOne follow the python2/oils string model

80f5e30

Validate that Str methods handle zero-bytes gracefully

9446a64

reword comment

57abe32

andychu reviewed May 13, 2024

View reviewed changes

PossiblyAShrub added 5 commits May 16, 2024 22:44

Remove UTF8_ERR_END_OF_STREAM error case because we cannot reliably d…

1dcfdea

…etect it

Merge branch 'master' into utf8-fastfunc-cleanup

52e5de0

fix typo

5edfb61

Use DCHECK in cpp/data_lang.cc

d3b502c

Note instead that the NUL more importantly prevents buffer overruns

944295d

andychu changed the base branch from master to soil-staging May 17, 2024 05:41

andychu merged commit 6b787ad into soil-staging May 17, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fastfunc] Use the python2/oils string model in Utf8DecodeOne #1967

[fastfunc] Use the python2/oils string model in Utf8DecodeOne #1967

PossiblyAShrub commented May 12, 2024

andychu left a comment

andychu May 13, 2024

andychu May 13, 2024 •

edited

andychu commented May 13, 2024 •

edited

PossiblyAShrub commented May 17, 2024

andychu commented May 17, 2024

andychu commented May 17, 2024

[fastfunc] Use the python2/oils string model in Utf8DecodeOne #1967

[fastfunc] Use the python2/oils string model in Utf8DecodeOne #1967

Conversation

PossiblyAShrub commented May 12, 2024

andychu left a comment

Choose a reason for hiding this comment

andychu May 13, 2024

Choose a reason for hiding this comment

andychu May 13, 2024 • edited

Choose a reason for hiding this comment

andychu commented May 13, 2024 • edited

PossiblyAShrub commented May 17, 2024

andychu commented May 17, 2024

andychu commented May 17, 2024

andychu May 13, 2024 •

edited

andychu commented May 13, 2024 •

edited