-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logstash::Json#dump writes invalid JSON when source contains non-UTF8 strings #15833
Comments
Investigation comments: High level comment for the situation
Tech details
To wrap
|
Logstash information:
Please include the following information:
bin/logstash --version
): 7.x, 8.x, 8.12Plugins installed: Default (
bin/logstash-plugin list --verbose
)JVM (e.g.
java -version
): BundledOS version (
uname -a
if on a Unix-like system): anyDescription of the problem including expected versus actual behavior:
When an event contains a field with a string that has an encoding other than
UTF-8
, or has aUTF-8
encoding flag but contains byte sequences that are not valid UTF-8, the bytes of that string is serialized by theLogStash::Json.dump
helper without accounting for their encoding. This is different than what we observe using Ruby'sJSON::dump
(launch logstash with-i pry
for an interactive console):The behaviour also varies when the string cannot be encoded as UTF-8, such as with a
BINARY
string that is appropriately flagged asBINARY
:And with a UTF-8-flagged string that contains non-UTF-8 data:
This is especially problematic when plugins supply an [
event
][original
] that containsBINARY
data, as plugins like the Elasticsearch Output use Logstash's json helper to serialize events, resulting in malformed strings being sent to and rejected by Elasticsearch.As of now, we do not have best-practices for
BINARY
data, or for encodings that don't have conversions for all code-points. As a workaround I wrote a script for the ruby filter that can coerce individual string fields, optionally stashing a base64-encoded version of the original bytes in another field when the input cannot be reliably representationally transcoded:utf8-coerce.logstash-filter-ruby.rb
The text was updated successfully, but these errors were encountered: