Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logstash::Json#dump writes invalid JSON when source contains non-UTF8 strings #15833

Closed
yaauie opened this issue Jan 23, 2024 · 1 comment · Fixed by #16072
Closed

Logstash::Json#dump writes invalid JSON when source contains non-UTF8 strings #15833

yaauie opened this issue Jan 23, 2024 · 1 comment · Fixed by #16072
Assignees
Labels

Comments

@yaauie
Copy link
Member

yaauie commented Jan 23, 2024

Logstash information:

Please include the following information:

  1. Logstash version (e.g. bin/logstash --version): 7.x, 8.x, 8.12
  2. Logstash installation source: all
  3. How is Logstash being run: any

Plugins installed: Default (bin/logstash-plugin list --verbose)

JVM (e.g. java -version): Bundled

OS version (uname -a if on a Unix-like system): any

Description of the problem including expected versus actual behavior:

When an event contains a field with a string that has an encoding other than UTF-8, or has a UTF-8 encoding flag but contains byte sequences that are not valid UTF-8, the bytes of that string is serialized by the LogStash::Json.dump helper without accounting for their encoding. This is different than what we observe using Ruby's JSON::dump (launch logstash with -i pry for an interactive console):

[23] pry(#<LogStash::Runner>)> ibm437_string = "\xAB\xAC".force_encoding('IBM437')
=> "\xAB\xAC"
[27] pry(#<LogStash::Runner>)> encode = ->(encoder, string) { encoder.dump(string).then {|json| {bytes: json.bytes, encoding: json.encoding, valid: json.valid_encoding?} } }
=> #<Proc:0x2a673ee8 (pry):27 (lambda)>
[28] pry(#<LogStash::Runner>)> encode.(LogStash::Json, ibm437_string)
=> {:bytes=>[34, 171, 172, 34], :encoding=>#<Encoding:UTF-8>, :valid=>false}
[29] pry(#<LogStash::Runner>)> encode.(::JSON, ibm437_string)
=> {:bytes=>[34, 194, 189, 194, 188, 34], :encoding=>#<Encoding:UTF-8>, :valid=>true}

The behaviour also varies when the string cannot be encoded as UTF-8, such as with a BINARY string that is appropriately flagged as BINARY:

[36] pry(#<LogStash::Runner>)> binary_string = "\xa0\xa1".force_encoding('BINARY')
=> "\xA0\xA1"
[40] pry(#<LogStash::Runner>)> encode.(LogStash::Json, binary_string)
=> {:bytes=>[34, 160, 161, 34], :encoding=>#<Encoding:UTF-8>, :valid=>false}
[41] pry(#<LogStash::Runner>)> encode.(::JSON, binary_string)
Encoding::UndefinedConversionError: "\xA0" from ASCII-8BIT to UTF-8
from org/jruby/RubyString.java:6646:in `encode'

And with a UTF-8-flagged string that contains non-UTF-8 data:

[42] pry(#<LogStash::Runner>)> binary_string_claims_utf8 = "\xa0\xa1".force_encoding('UTF-8')
=> "\xA0\xA1"
[43] pry(#<LogStash::Runner>)> encode.(LogStash::Json, binary_string_claims_utf8)
=> {:bytes=>[34, 160, 161, 34], :encoding=>#<Encoding:UTF-8>, :valid=>false}
[44] pry(#<LogStash::Runner>)> encode.(::JSON, binary_string_claims_utf8)
JSON::GeneratorError: source sequence is illegal/malformed utf-8
from /Users/yaauie/src/elastic/ls/vendor/bundle/jruby/3.1.0/gems/json-2.7.1-java/lib/json/common.rb:626:in `dump'

This is especially problematic when plugins supply an [event][original] that contains BINARY data, as plugins like the Elasticsearch Output use Logstash's json helper to serialize events, resulting in malformed strings being sent to and rejected by Elasticsearch.

As of now, we do not have best-practices for BINARY data, or for encodings that don't have conversions for all code-points. As a workaround I wrote a script for the ruby filter that can coerce individual string fields, optionally stashing a base64-encoded version of the original bytes in another field when the input cannot be reliably representationally transcoded: utf8-coerce.logstash-filter-ruby.rb

@mashhurs
Copy link
Contributor

mashhurs commented Apr 6, 2024

Investigation comments:

High level comment for the situation

  • We shouldn't compare JSON and LogStash:Json serialization behavior, instead focus on what is the expectation and desired behavior with LogStash:Json. JSON clearly rejects the payload if it is invalid UTF-8. However, it will not be suitable in Logstash ecosystem with number of reasons such as event.original may contain invalid unicode and still claim UTF-8 (will provide tech details below) as it may come from various upstream sources (file, elastic-agent, UDP, TCP, etc...).
  • The expected behavior we need with current state for the JSON payloads, is to safely deliver the message to the downstream assuming the unicode.

Tech details

  • Logstash internally uses jrjackson library (bases on FasterXML jackson modules) to convert the payload to the JSON format (JrJackson::Base.generate(o, options)). jrjackson expects the byte stream to be shape of UTF-8, generates Ruby#String (which has internal Encoding, defaults to UTF-8) from the given bytes of the input regardless of different encoding of the input.

  • Ruby has its own representational layer for the encodings, how-to store the bytes for unicodes. See this blog to understand the evolution of the encoding in the ruby environment. That means, the bytes of A encoding and the bytes of B encoding differ, to support the string operations like String#revere. To support with an example, "\xEFs" in Windows-1252 will be with [239, 115] two bytes, identical to "\u{EF}s" in unicode, which is three bytes in ruby representation. Blindly copying the bytes without taking care the source encoding brings an invalid outcome. This is where ruby introduced Encoding::Converter. When we use the converter, it figures out the representation format (or whichever internal details) and creates a claimed UTF-8 encoding, which will be identical three bytes, equals to "\u{EF}s".

    [13] pry(#<LogStash::Runner>)> 
    [14] pry(#<LogStash::Runner>)> "\u{EF}s".bytes
     => [195, 175, 115]
    [15] pry(#<LogStash::Runner>)> "\xEFs".b.force_encoding(Encoding::WINDOWS_1252).bytes
     => [239, 115]
    [16] pry(#<LogStash::Runner>)> ec = Encoding::Converter.new("WINDOWS-1252", "UTF-8")
     => #<Encoding::Converter: Windows-1252 to UTF-8>
    [17] pry(#<LogStash::Runner>)> ec.convert("\xEFs".b.force_encoding(Encoding::WINDOWS_1252)).bytes
      => [195, 175, 115]
    
  • However, as ruby is a strong scripting language which provides alot flexibilities, it may also keep the invalid unicode byte stream in its representational layer. Example, ruby stores invalid UTF-8 "\x8A" and unless we claim it as UTF-8, it doesn't reject but still defaults to UTF-8.

    [20] pry(#<LogStash::Runner>)> "\x8A".encoding
     => #<Encoding:UTF-8>
    [21] pry(#<LogStash::Runner>)> "\x8A".unicode_normalize
     ArgumentError: invalid byte sequence in UTF-8 from org/jruby/RubyString.java:3192:in `gsub'
    
    image

To wrap
With this situation in Logstash, we have two areas to improve (both should happen before JrJackson::Base.generate(o, options)):

  1. Make sure it considers the source input encoding when generating UTF-8 outputs.
  2. Make sure to validate the unicode data before telling the downstreams that is UTF-8, if not as a best practice in current internalization world, use replace character to express the invalid bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants