Replace non-utf8 characters in bodies #228

prodion23 · 2024-04-26T15:53:11Z

This handles case where invalid utf8 is introduced because of truncation occuring in the middle of multibyte characters. Truncation now takes into account the character where truncation will occur, if it's length would put us over the truncation limit it is omitted.
(new unit test show example where rune at char 21-24 would put us over limit of 23, so the first 2 bytes are omitted and we only capture up to byte 21)

Attached perf results of this approach vs previous someBodyString[:maxLength] at bottom of this PR

puneet-traceable · 2024-04-26T16:38:06Z

sdk/instrumentation/bodyattribute/bodyattribute.go

-	span.SetAttribute(attrName, string(body))
+	bodyStr := string(body)
+	if !utf8.ValidString(bodyStr) {
+		bodyStr = strings.ToValidUTF8(bodyStr, utf8Replacement)


There is a string conversion from bytes and then a utf8 replace. Might become an expensive one.

I'll check perf

Validation should atleast use the right method without the conversion.
unicode/utf8 has Validate([]byte) bool -> https://pkg.go.dev/unicode/utf8#Valid
strings/utf8 has methods for string.
@ryanericson should we drop instead of correcting these?

Okay, updating to use utf8.Valid method shows similar perf. (unfortunately not good in either case)
if !utf8.Valid(body) {
bodyStr = strings.ToValidUTF8(string(body), utf8Replacement)
}

BenchmarkSetBodyWithValidUtf8-12 1947141 618.5 ns/op BenchmarkSetBodyWithInvalidUtf8-12 70920 18097 ns/op

func BenchmarkSetBodyWithValidUtf8(b *testing.B) { validUTF8 := bytes.Repeat([]byte("hello world "), 500) span := mock.NewSpan() b.ResetTimer() for i := 0; i < b.N; i++ { SetBodyAttribute("http.request.body", validUTF8, false, span) } } func BenchmarkSetBodyWithInvalidUtf8(b *testing.B) { baseString := "hello world " invalidBytes := []byte{0xff, 0xfe, 0xfd} largeInvalidUTF8 := make([]byte, 0, 6000) for len(largeInvalidUTF8) < 5970 { largeInvalidUTF8 = append(largeInvalidUTF8, baseString...) largeInvalidUTF8 = append(largeInvalidUTF8, invalidBytes...) } span := mock.NewSpan() b.ResetTimer() for i := 0; i < b.N; i++ { SetBodyAttribute("http.request.body", largeInvalidUTF8, false, span) } }

I tried this, the non-conversion one is still faster if we just look at valid utf's case. I think due the number of replacements needed in case of invalid one's, we do not see any advantage. This is on arm though.

import ( "bytes" "strings" "testing" "unicode/utf8" ) var length int const utf8Replacement = "�" func SetBodyAttribute1(body []byte) int { if len(body) == 0 { return 0 } bodyStr := string(body) if !utf8.ValidString(bodyStr) { bodyStr = strings.ToValidUTF8(bodyStr, utf8Replacement) } return len(body) } func SetBodyAttribute2(body []byte) int { if len(body) == 0 { return 0 } var bodyStr string if !utf8.Valid(body) { bodyStr = strings.ToValidUTF8(string(body), utf8Replacement) } return len(bodyStr) } func BenchmarkSetBodyAttribute1WithInvalidUtf8(b *testing.B) { baseString := "hello world " invalidBytes := []byte{0xff, 0xfe, 0xfd} largeInvalidUTF8 := make([]byte, 0, 6000) for len(largeInvalidUTF8) < 5970 { largeInvalidUTF8 = append(largeInvalidUTF8, baseString...) largeInvalidUTF8 = append(largeInvalidUTF8, invalidBytes...) } b.ResetTimer() for i := 0; i < b.N; i++ { length += SetBodyAttribute1(largeInvalidUTF8) } } func BenchmarkSetBodyAttribute2WithInvalidUtf8(b *testing.B) { baseString := "hello world " invalidBytes := []byte{0xff, 0xfe, 0xfd} largeInvalidUTF8 := make([]byte, 0, 6000) for len(largeInvalidUTF8) < 5970 { largeInvalidUTF8 = append(largeInvalidUTF8, baseString...) largeInvalidUTF8 = append(largeInvalidUTF8, invalidBytes...) } b.ResetTimer() for i := 0; i < b.N; i++ { length += SetBodyAttribute2(largeInvalidUTF8) } } func BenchmarkSetBodyAttribute1WithValidUtf8(b *testing.B) { validUTF8 := bytes.Repeat([]byte("hello world "), 500) b.ResetTimer() for i := 0; i < b.N; i++ { length += SetBodyAttribute1(validUTF8) } } func BenchmarkSetBodyAttribute2WithValidUtf8(b *testing.B) { validUTF8 := bytes.Repeat([]byte("hello world "), 500) b.ResetTimer() for i := 0; i < b.N; i++ { length += SetBodyAttribute2(validUTF8) } }

ubuntu@ub22:~/ws_go/samples/goagent$ go test -bench=. goos: linux goarch: arm64 pkg: samples/goagent BenchmarkSetBodyAttribute1WithInvalidUtf8-3 94282 13102 ns/op BenchmarkSetBodyAttribute2WithInvalidUtf8-3 94063 12888 ns/op BenchmarkSetBodyAttribute1WithValidUtf8-3 1000000 1380 ns/op BenchmarkSetBodyAttribute2WithValidUtf8-3 2686004 445.3 ns/op PASS ok samples/goagent 5.767s

Hmm thinking about this now, I don't like the perf implication that we need to do the validation at the cost of few cases where invalid bytes are there. @prodion23 let's just try to fix this at the source in case we somehow corrupted it ourselves. Truncation is one part and seems like otel SDK has a way to do UTF8 aware truncation now: open-telemetry/opentelemetry-go#3156.

Did you also observe that for valid UTF-8 chars (with foreign or multibyte chars) we're also seeing this error like @tim-mwangi says?

puneet-traceable · 2024-04-26T18:07:33Z

sdk/instrumentation/bodyattribute/bodyattribute.go


 	"github.com/hypertrace/goagent/sdk"
 )

+const utf8Replacement = "�"


Also, I am not sure if substituting this sequence of bytes is sufficient to fix these errors.

I am not sure if substituting this sequence of bytes is sufficient to fix these errors.

Hmm, it should resolve any errors from invalid bytes being in the request or response body from the protobuf perspective? (at least I see this sample export successfully now)
Or are you thinking there are other cases where the replace wouldn't actually replace all invalid byte sequence?

If they appear in headers we'd still run into exception
(but not sure how likely that is for us to be concerned of right now?

I am trying to understand what causes this invalid byte sequence.

prodion23 · 2024-05-01T16:20:35Z

func BenchmarkNewTruncateSetBodyWithValidUtf8(b *testing.B) {
	span := mock.NewSpan()
	b.ResetTimer()
	starter := []byte("世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界")
	var input []byte
	for len(input) < 5970 {
		input = append(input, starter...)
	}
	for i := 0; i < b.N; i++ {
		SetTruncatedBodyAttribute("http.request.body", input, 5960, span, true)
	}
}

func BenchmarkOldTruncateSetBodyWithValidUtf8(b *testing.B) {
	span := mock.NewSpan()
	b.ResetTimer()
	starter := []byte("世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界")
	var input []byte
	for len(input) < 5970 {
		input = append(input, starter...)
	}
	for i := 0; i < b.N; i++ {
		SetTruncatedBodyAttribute("http.request.body", input, 5960, span, false)
	}
}

BenchmarkNewTruncateSetBodyWithValidUtf8
BenchmarkNewTruncateSetBodyWithValidUtf8-12    	  740908	      2143 ns/op
BenchmarkOldTruncateSetBodyWithValidUtf8
BenchmarkOldTruncateSetBodyWithValidUtf8-12    	  679801	      1777 ns/op

sdk/instrumentation/bodyattribute/bodyattribute.go

tim-mwangi · 2024-05-01T19:49:04Z

sdk/instrumentation/bodyattribute/bodyattribute.go


 	"github.com/hypertrace/goagent/sdk"
 )

+const utf8Replacement = ""


tim-mwangi

lgtm

tim-mwangi

Need to look at that codecov upload issue

puneet-traceable · 2024-05-02T06:34:49Z

func BenchmarkNewTruncateSetBodyWithValidUtf8(b *testing.B) {
	span := mock.NewSpan()
	b.ResetTimer()
	starter := []byte("世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界")
	var input []byte
	for len(input) < 5970 {
		input = append(input, starter...)
	}
	for i := 0; i < b.N; i++ {
		SetTruncatedBodyAttribute("http.request.body", input, 5960, span, true)
	}
}

func BenchmarkOldTruncateSetBodyWithValidUtf8(b *testing.B) {
	span := mock.NewSpan()
	b.ResetTimer()
	starter := []byte("世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界世界")
	var input []byte
	for len(input) < 5970 {
		input = append(input, starter...)
	}
	for i := 0; i < b.N; i++ {
		SetTruncatedBodyAttribute("http.request.body", input, 5960, span, false)
	}
}

BenchmarkNewTruncateSetBodyWithValidUtf8
BenchmarkNewTruncateSetBodyWithValidUtf8-12    	  740908	      2143 ns/op
BenchmarkOldTruncateSetBodyWithValidUtf8
BenchmarkOldTruncateSetBodyWithValidUtf8-12    	  679801	      1777 ns/op

@prodion23 b.ResetTimer() should have be after the input is ready.

puneet-traceable · 2024-05-02T06:38:15Z

We are relying on truncation flag. in eBPF we are not always able to correctly say that truncation has happened. We may need to look at that to ensure that truncation flag is passed correctly.

puneet-traceable · 2024-05-02T06:49:30Z

sdk/instrumentation/bodyattribute/bodyattribute.go

+	}
+
+	for idx := startIndex; idx < maxBytes; {
+		_, size := utf8.DecodeRune(b[idx:])


what is our assumption here? max-4 can land us in the middle of previous rune and when we try DecodeRune from that index, it may give runeError which we are ignoring.
So, are we just trying to make sure that we correct the last rune if it got corrupted because of truncation.
I am fine with that, just trying to make sure I understand the behaviour well and we add comments around the assumption/behaviour.

tim-mwangi · 2024-05-06T17:41:02Z

Hey @prodion23 we should also look into the codecov issue. I do not know how the other hypertrace repos handle it.

codecov · 2024-05-09T17:28:51Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 57.80%. Comparing base (87ab698) to head (37c9426).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #228      +/-   ##
==========================================
+ Coverage   57.53%   57.80%   +0.26%     
==========================================
  Files          69       69              
  Lines        2720     2737      +17     
==========================================
+ Hits         1565     1582      +17     
  Misses       1085     1085              
  Partials       70       70

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

prodion23 · 2024-05-09T18:57:56Z

@tim-mwangi codecov issue is resolved, need to pass token now

tim-mwangi · 2024-05-09T19:07:34Z

examples/http-server/main.go

@@ -1,5 +1,3 @@
-//go:build ignore


put this back since it's in the examples folder.

Replace non-utf8 characters in bodies

fd65cf0

prodion23 requested review from puneet-traceable and tim-mwangi April 26, 2024 15:55

imports

b5d0a5b

puneet-traceable reviewed Apr 26, 2024

View reviewed changes

Update truncation method

29a0e55

ryanericson reviewed May 1, 2024

View reviewed changes

sdk/instrumentation/bodyattribute/bodyattribute.go Show resolved Hide resolved

tim-mwangi reviewed May 1, 2024

View reviewed changes

Add comment, remove unused const

104cd9d

tim-mwangi approved these changes May 1, 2024

View reviewed changes

ryanericson approved these changes May 1, 2024

View reviewed changes

puneet-traceable reviewed May 2, 2024

View reviewed changes

add comment about intention

5999ec5

fix codecov

bba5bc1

net upgrade for snyk

a4838c5

tim-mwangi reviewed May 9, 2024

View reviewed changes

skip example build

37c9426

prodion23 merged commit 7a089ab into main May 9, 2024
6 of 7 checks passed

tim-mwangi deleted the prevent-non-utf8-compat-bodies branch May 17, 2024 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace non-utf8 characters in bodies #228

Replace non-utf8 characters in bodies #228

prodion23 commented Apr 26, 2024 •

edited

puneet-traceable Apr 26, 2024 •

edited

prodion23 Apr 26, 2024

puneet-traceable Apr 26, 2024 •

edited

prodion23 Apr 26, 2024

puneet-traceable Apr 26, 2024

ryanericson Apr 26, 2024

puneet-traceable Apr 26, 2024

prodion23 Apr 26, 2024

puneet-traceable Apr 26, 2024

prodion23 commented May 1, 2024

tim-mwangi May 1, 2024

tim-mwangi left a comment

tim-mwangi left a comment

puneet-traceable commented May 2, 2024 •

edited

puneet-traceable commented May 2, 2024

puneet-traceable May 2, 2024 •

edited

tim-mwangi commented May 6, 2024

codecov bot commented May 9, 2024 •

edited

prodion23 commented May 9, 2024

tim-mwangi May 9, 2024

Replace non-utf8 characters in bodies #228

Replace non-utf8 characters in bodies #228

Conversation

prodion23 commented Apr 26, 2024 • edited

puneet-traceable Apr 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

puneet-traceable Apr 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prodion23 commented May 1, 2024

Choose a reason for hiding this comment

tim-mwangi left a comment

Choose a reason for hiding this comment

tim-mwangi left a comment

Choose a reason for hiding this comment

puneet-traceable commented May 2, 2024 • edited

puneet-traceable commented May 2, 2024

puneet-traceable May 2, 2024 • edited

Choose a reason for hiding this comment

tim-mwangi commented May 6, 2024

codecov bot commented May 9, 2024 • edited

Codecov Report

prodion23 commented May 9, 2024

Choose a reason for hiding this comment

prodion23 commented Apr 26, 2024 •

edited

puneet-traceable Apr 26, 2024 •

edited

puneet-traceable Apr 26, 2024 •

edited

puneet-traceable commented May 2, 2024 •

edited

puneet-traceable May 2, 2024 •

edited

codecov bot commented May 9, 2024 •

edited