Logstash::Json#dump writes invalid JSON when source contains non-UTF8 strings #15833

yaauie · 2024-01-23T01:04:19Z

Logstash information:

Please include the following information:

Logstash version (e.g. bin/logstash --version): 7.x, 8.x, 8.12
Logstash installation source: all
How is Logstash being run: any

Plugins installed: Default (bin/logstash-plugin list --verbose)

JVM (e.g. java -version): Bundled

OS version (uname -a if on a Unix-like system): any

Description of the problem including expected versus actual behavior:

When an event contains a field with a string that has an encoding other than UTF-8, or has a UTF-8 encoding flag but contains byte sequences that are not valid UTF-8, the bytes of that string is serialized by the LogStash::Json.dump helper without accounting for their encoding. This is different than what we observe using Ruby's JSON::dump (launch logstash with -i pry for an interactive console):

[23] pry(#<LogStash::Runner>)> ibm437_string = "\xAB\xAC".force_encoding('IBM437')
=> "\xAB\xAC"
[27] pry(#<LogStash::Runner>)> encode = ->(encoder, string) { encoder.dump(string).then {|json| {bytes: json.bytes, encoding: json.encoding, valid: json.valid_encoding?} } }
=> #<Proc:0x2a673ee8 (pry):27 (lambda)>
[28] pry(#<LogStash::Runner>)> encode.(LogStash::Json, ibm437_string)
=> {:bytes=>[34, 171, 172, 34], :encoding=>#<Encoding:UTF-8>, :valid=>false}
[29] pry(#<LogStash::Runner>)> encode.(::JSON, ibm437_string)
=> {:bytes=>[34, 194, 189, 194, 188, 34], :encoding=>#<Encoding:UTF-8>, :valid=>true}

The behaviour also varies when the string cannot be encoded as UTF-8, such as with a BINARY string that is appropriately flagged as BINARY:

[36] pry(#<LogStash::Runner>)> binary_string = "\xa0\xa1".force_encoding('BINARY')
=> "\xA0\xA1"
[40] pry(#<LogStash::Runner>)> encode.(LogStash::Json, binary_string)
=> {:bytes=>[34, 160, 161, 34], :encoding=>#<Encoding:UTF-8>, :valid=>false}
[41] pry(#<LogStash::Runner>)> encode.(::JSON, binary_string)
Encoding::UndefinedConversionError: "\xA0" from ASCII-8BIT to UTF-8
from org/jruby/RubyString.java:6646:in `encode'

And with a UTF-8-flagged string that contains non-UTF-8 data:

[42] pry(#<LogStash::Runner>)> binary_string_claims_utf8 = "\xa0\xa1".force_encoding('UTF-8')
=> "\xA0\xA1"
[43] pry(#<LogStash::Runner>)> encode.(LogStash::Json, binary_string_claims_utf8)
=> {:bytes=>[34, 160, 161, 34], :encoding=>#<Encoding:UTF-8>, :valid=>false}
[44] pry(#<LogStash::Runner>)> encode.(::JSON, binary_string_claims_utf8)
JSON::GeneratorError: source sequence is illegal/malformed utf-8
from /Users/yaauie/src/elastic/ls/vendor/bundle/jruby/3.1.0/gems/json-2.7.1-java/lib/json/common.rb:626:in `dump'

This is especially problematic when plugins supply an [event][original] that contains BINARY data, as plugins like the Elasticsearch Output use Logstash's json helper to serialize events, resulting in malformed strings being sent to and rejected by Elasticsearch.

As of now, we do not have best-practices for BINARY data, or for encodings that don't have conversions for all code-points. As a workaround I wrote a script for the ruby filter that can coerce individual string fields, optionally stashing a base64-encoded version of the original bytes in another field when the input cannot be reliably representationally transcoded: utf8-coerce.logstash-filter-ruby.rb

The text was updated successfully, but these errors were encountered:

mashhurs · 2024-04-06T17:10:16Z

Investigation comments:

High level comment for the situation

We shouldn't compare JSON and LogStash:Json serialization behavior, instead focus on what is the expectation and desired behavior with LogStash:Json. JSON clearly rejects the payload if it is invalid UTF-8. However, it will not be suitable in Logstash ecosystem with number of reasons such as event.original may contain invalid unicode and still claim UTF-8 (will provide tech details below) as it may come from various upstream sources (file, elastic-agent, UDP, TCP, etc...).
The expected behavior we need with current state for the JSON payloads, is to safely deliver the message to the downstream assuming the unicode.

Tech details

Logstash internally uses jrjackson library (bases on FasterXML jackson modules) to convert the payload to the JSON format (JrJackson::Base.generate(o, options)). jrjackson expects the byte stream to be shape of UTF-8, generates Ruby#String (which has internal Encoding, defaults to UTF-8) from the given bytes of the input regardless of different encoding of the input.
Ruby has its own representational layer for the encodings, how-to store the bytes for unicodes. See this blog to understand the evolution of the encoding in the ruby environment. That means, the bytes of A encoding and the bytes of B encoding differ, to support the string operations like String#revere. To support with an example, "\xEFs" in Windows-1252 will be with [239, 115] two bytes, identical to "\u{EF}s" in unicode, which is three bytes in ruby representation. Blindly copying the bytes without taking care the source encoding brings an invalid outcome. This is where ruby introduced Encoding::Converter. When we use the converter, it figures out the representation format (or whichever internal details) and creates a claimed UTF-8 encoding, which will be identical three bytes, equals to "\u{EF}s".
```
[13] pry(#<LogStash::Runner>)> 
[14] pry(#<LogStash::Runner>)> "\u{EF}s".bytes
 => [195, 175, 115]
[15] pry(#<LogStash::Runner>)> "\xEFs".b.force_encoding(Encoding::WINDOWS_1252).bytes
 => [239, 115]
[16] pry(#<LogStash::Runner>)> ec = Encoding::Converter.new("WINDOWS-1252", "UTF-8")
 => #<Encoding::Converter: Windows-1252 to UTF-8>
[17] pry(#<LogStash::Runner>)> ec.convert("\xEFs".b.force_encoding(Encoding::WINDOWS_1252)).bytes
  => [195, 175, 115]
```
However, as ruby is a strong scripting language which provides alot flexibilities, it may also keep the invalid unicode byte stream in its representational layer. Example, ruby stores invalid UTF-8 "\x8A" and unless we claim it as UTF-8, it doesn't reject but still defaults to UTF-8.
```
[20] pry(#<LogStash::Runner>)> "\x8A".encoding
 => #<Encoding:UTF-8>
[21] pry(#<LogStash::Runner>)> "\x8A".unicode_normalize
 ArgumentError: invalid byte sequence in UTF-8 from org/jruby/RubyString.java:3192:in `gsub'
```

To wrap
With this situation in Logstash, we have two areas to improve (both should happen before JrJackson::Base.generate(o, options)):

Make sure it considers the source input encoding when generating UTF-8 outputs.
Make sure to validate the unicode data before telling the downstreams that is UTF-8, if not as a best practice in current internalization world, use replace character to express the invalid bytes.

yaauie added bug status:needs-triage labels Jan 23, 2024

roaksoax assigned edmocosta Feb 8, 2024

yaauie mentioned this issue Mar 18, 2024

When event JSON data contains non UTF-8 invalid bytes, replace with replacement characters. logstash-plugins/logstash-output-elasticsearch#1169

Merged

roaksoax assigned mashhurs and unassigned edmocosta Mar 26, 2024

mashhurs mentioned this issue Apr 8, 2024

Handle non-unicode payload in Logstash. #16072

Merged

7 tasks

mashhurs closed this as completed in #16072 May 16, 2024

roaksoax removed the status:needs-triage label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logstash::Json#dump writes invalid JSON when source contains non-UTF8 strings #15833

Logstash::Json#dump writes invalid JSON when source contains non-UTF8 strings #15833

yaauie commented Jan 23, 2024 •

edited

mashhurs commented Apr 6, 2024 •

edited

Logstash::Json#dump writes invalid JSON when source contains non-UTF8 strings #15833

Logstash::Json#dump writes invalid JSON when source contains non-UTF8 strings #15833

Comments

yaauie commented Jan 23, 2024 • edited

mashhurs commented Apr 6, 2024 • edited

yaauie commented Jan 23, 2024 •

edited

mashhurs commented Apr 6, 2024 •

edited