Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on bad request (resource_already_exists_exception) should not happen #983

Open
2 tasks
applike-ss opened this issue Aug 30, 2022 · 6 comments
Open
2 tasks

Comments

@applike-ss
Copy link
Contributor

(check apply)

  • read the contribution guideline
  • (optional) already reported 3rd party upstream repository or mailing list if you use k8s addon or helm charts.

Problem

We're facing the following error message in fluentd and it states that fluentds elasticsearch output plugin tried to create a datastream which was already created (likely by a different fluentd pod in our kubernetes cluster).

2022-08-30 09:54:49 +0000 [warn]: #0 Could not communicate to Elasticsearch, resetting connection and trying again. [400] {"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"data_stream [my-data-stream] already exists"}],"type":"resource_already_exists_exception","reason":"data_stream [my-data-stream] already exists"},"status":400}
2022-08-30 09:54:49 +0000 [warn]: #0 Remaining retry: 2. Retry to communicate after 256 second(s).

Steps to replicate

This is a race condition, so might be hard to replicate.

  • have an elasticsearch cluster
  • configure fluentd to push logs to it via data stream output plugin
  • ensure the datastream did not exist already
  • let multiple fluentds start simultaneously to ingest logs (preferrably fixed input file or dummy data?)

Expected Behavior or What you need to ask

The plugin should be checking on creation of resources if it gets a resource_already_exists_exception (http status 400 is returned, but that's not always a resource_already_exists_exception).
If it gets this exception, it should either assume the resource indeed is already existent or if we are paranoid, then it should query for the to be created resource.
Eventually there should not be any error log for a resource that should be created, but does already exist.
...

Using Fluentd and ES plugin versions

  • OS version - Debian GNU/Linux 11 (bullseye)
  • Bare Metal or within Docker or Kubernetes or others? - Docker (v1.15-debian-elasticsearch7-1)
  • Fluentd v0.12 or v0.14/v1.0 - 1.15.2
    • root@7d41daf401df:/home/fluent# fluentd --version
      fluentd 1.15.2
  • ES plugin 3.x.y/2.x.y or 1.x.y - 5.2.3
root@7d41daf401df:/home/fluent# fluent-gem list

*** LOCAL GEMS ***

abbrev (default: 0.1.0)
addressable (2.8.1)
base64 (default: 0.1.1)
benchmark (default: 0.2.0)
bigdecimal (default: 3.1.1)
bundler (default: 2.3.7, 2.2.24, 2.2.15)
cgi (default: 0.3.1)
concurrent-ruby (1.1.10)
cool.io (1.7.1)
csv (default: 3.2.2)
date (default: 3.2.2)
delegate (default: 0.2.0)
did_you_mean (default: 1.6.1)
digest (default: 3.1.0)
domain_name (0.5.20190701)
drb (default: 2.1.0)
elasticsearch (7.17.1)
elasticsearch-api (7.17.1)
elasticsearch-transport (7.17.1)
elasticsearch-xpack (7.17.1)
english (default: 0.7.1)
erb (default: 2.2.3)
error_highlight (default: 0.3.0)
etc (default: 1.3.0)
excon (0.92.4)
faraday (1.10.2)
faraday-em_http (1.0.0)
faraday-em_synchrony (1.0.0)
faraday-excon (1.1.0)
faraday-httpclient (1.0.1)
faraday-multipart (1.0.4)
faraday-net_http (1.0.1)
faraday-net_http_persistent (1.2.0)
faraday-patron (1.0.0)
faraday-rack (1.0.0)
faraday-retry (1.0.3)
fcntl (default: 1.0.1)
ffi (1.15.5)
ffi-compiler (1.0.1)
fiddle (default: 1.1.0)
fileutils (default: 1.6.0)
find (default: 0.1.1)
fluent-config-regexp-type (1.0.0)
fluent-plugin-concat (2.5.0)
fluent-plugin-dedot_filter (1.0.0)
fluent-plugin-detect-exceptions (0.0.14)
fluent-plugin-elasticsearch (5.2.3)
fluent-plugin-grafana-loki (1.2.18)
fluent-plugin-grok-parser (2.6.2)
fluent-plugin-json-in-json-2 (1.0.2)
fluent-plugin-kubernetes_metadata_filter (2.13.0)
fluent-plugin-multi-format-parser (1.0.0)
fluent-plugin-parser-cri (0.1.1)
fluent-plugin-prometheus (2.0.3)
fluent-plugin-record-modifier (2.1.0)
fluent-plugin-rewrite-tag-filter (2.4.0)
fluent-plugin-systemd (1.0.5)
fluentd (1.15.2)
forwardable (default: 1.3.2)
getoptlong (default: 0.1.1)
http (4.4.1)
http-accept (1.7.0)
http-cookie (1.0.5)
http-form_data (2.3.0)
http-parser (1.2.3)
http_parser.rb (0.8.0)
io-console (default: 0.5.11)
io-nonblock (default: 0.1.0)
io-wait (default: 0.2.1)
ipaddr (default: 1.2.4)
irb (default: 1.4.1)
json (default: 2.6.1)
jsonpath (1.1.2)
kubeclient (4.9.3)
logger (default: 1.5.0)
lru_redux (1.1.0)
mime-types (3.4.1)
mime-types-data (3.2022.0105)
msgpack (1.5.6)
multi_json (1.15.0)
multipart-post (2.2.3)
mutex_m (default: 0.1.1)
net-http (default: 0.2.0)
net-protocol (default: 0.1.2)
netrc (0.11.0)
nkf (default: 0.1.1)
observer (default: 0.1.1)
oj (3.13.21)
open-uri (default: 0.2.0)
open3 (default: 0.1.1)
openssl (default: 3.0.0)
optparse (default: 0.2.0)
ostruct (default: 0.5.2)
pathname (default: 0.2.0)
pp (default: 0.3.0)
prettyprint (default: 0.1.1)
prometheus-client (4.0.0)
pstore (default: 0.1.1)
psych (default: 4.0.3)
public_suffix (5.0.0)
racc (default: 1.6.0)
rake (13.0.6)
rdoc (default: 6.4.0)
readline (default: 0.0.3)
readline-ext (default: 0.1.4)
recursive-open-struct (1.1.3)
reline (default: 0.3.0)
resolv (default: 0.2.1)
resolv-replace (default: 0.1.0)
rest-client (2.1.0)
rexml (3.2.5)
rinda (default: 0.1.1)
ruby2_keywords (default: 0.0.5)
securerandom (default: 0.1.1)
serverengine (2.3.0)
set (default: 1.0.2)
shellwords (default: 0.1.0)
sigdump (0.2.4)
singleton (default: 0.1.1)
stringio (default: 3.0.1)
strptime (0.2.5)
strscan (default: 3.0.1)
syslog (default: 0.1.0)
systemd-journal (1.4.2)
tempfile (default: 0.1.2)
time (default: 0.2.0)
timeout (default: 0.2.0)
tmpdir (default: 0.1.2)
tsort (default: 0.1.0)
tzinfo (2.0.5)
tzinfo-data (1.2022.3)
un (default: 0.2.0)
unf (0.1.4)
unf_ext (0.0.8.2)
uri (default: 0.11.0)
weakref (default: 0.1.1)
webrick (1.7.0)
yajl-ruby (1.4.3)
yaml (default: 0.2.0)
zlib (default: 2.1.1)
@cosmo0920
Copy link
Collaborator

  • let multiple fluentds start simultaneously to ingest logs (preferrably fixed input file or dummy data?)

This should not prevent duplicatation of indices name. It cannot to prevent to cause this issue.

@cosmo0920
Copy link
Collaborator

cosmo0920 commented Oct 24, 2022

For possible fix for operations:

Please try to set up aggregator node to aggregate logs from multiple Fluentd instances and send logs from only the aggregator.

ref: https://docs.fluentd.org/deployment/high-availability

@applike-ss
Copy link
Contributor Author

While this may actually ensure that the problem does not happen, it can not really be wrong to have multiple fluentd' running.
This seems to be happening because we do have the same application running on multiple nodes and it's resulting in the same datastream being used.
These applications send their logs via fluent-bit to our fluentd' instances and if they resolve the dns name to different ips, this is what's happening.
IMHO this is a bug in the addon and not in our setup.

@cosmo0920
Copy link
Collaborator

Again, this should not be prevent from Fluentd plugins.

Fluentd does not know Elasticsearch indices' states until sending and receiving HTTP requests from Elasticsearch cluster.
These sending terms does not handle synchronously and there is always possibility to cause race condition that confirms whether the indices are existing or not.

@applike-ss
Copy link
Contributor Author

The point i am trying to make is: We shouldn't be logging an error for a resource that already exists. Also we shouldn't return an error to our caller if the resource already exists (which the exception is saying us, though it is a 400 - because ES does not expect us to try to create the resource multiple times).

I don't see why you would be against implementing a check for "this error says the resource already exists, so lets assume our create call was successful".

@cosmo0920
Copy link
Collaborator

I don't see why you would be against implementing a check for "this error says the resource already exists, so lets assume our create call was successful".

I'm not much interest to implement for the corner case but we welcome the patch to fix for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants