You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We are up against a really strange - frustrating problem. I do not have any experience with FluentD at all, so I will try to give a representation as complete as possible.
We have deployed FluentD as a DaemonSet in a Kubernetes cluster. FluentD is configured to gather logs from multiple sources (Docker daemon, network, etc...) and send them to a hosted AWS ElasticSearch.
Along with the mentioned logging, we have in-app mechanisms that log directly to FluentD through a separate @type forward source created only for these in-app logging mechanisms, which is then forwarded through a match @type elasticsearch.
The problem is that this in-app log-flow creates a steady-but-slow memory leak on the node which it runs on.... The even stranger thing is that this leak is not happening in userspace application memory. Both the apps' and Fluentd's app memory remain stable. What is constantly increasing is kernel memory resulting in a constantly decreasing available memory of the node, until memory starvation problems begin. Note that I am referring to non-cache kernel memory that is not freed when requested. The applications are not that logging heavy. Max throughput should be around 10 loglines/sec from all together.
This is not hapenning with any of the other log configuration in Fluent where docker, system, kubernetes logs are scraped etc... If I turn off this in-app mechanism then there is no memory leak!
I have installed different monitoring tools on the server trying to see if some other metric's trend is related to the memory decrease... The only thing that I found matching a lot is IPv4 TCP memory usage, which kinda makes sense since this is how the in-app logs are sent to FluentD and also kernel related. However although the trend is similar, the actual memory amount does not match. In the screenshots attached below for the same time period, you can see that the system memory is decreased around 700MB while TCP memory usage increases only 30MB. However the trend is a complete match!
Any help with this problem would be really appreciated! Feel free to ask any extra information that you might need.
Below are the details about my configuration and set up.
To Reproduce
A simple pod running a NodeJS app sending directly logs to FluentD using the fluent-logger npm package is enough to cause the memory problem.
Expected behavior
I expect the kernel memory to remain stable when usage is also stable, as is the case with the rest of the logging configuration.
Your Environment
Fluentd or td-agent version: 1.11.4
Operating system: Debian GNU/Linux 9 (stretch)
Kernel version: 4.9.0-14-amd64
Kubernetes Version: v.1.16.15
Your Configuration
FluentD DaemonSet is deployed using latest (v11.3.0) chart version found in https://github.com/kokuwaio/helm-charts/blob/main/charts/fluentd-elasticsearch/Chart.yaml
Since there is a lot of configuration, I will only put here the relevant config that creates the problem. If all is needed let me know to paste it in a pastebin or sth....
<source>
@type forward
port 24226
bind 0.0.0.0
@label @CENTAUR
</source>
<filter **>
@type record_transformer
<record>
env staging
</record>
</filter>
<filter **>
@type record_transformer
<record>
fl.host "#{Socket.gethostname}"
</record>
</filter>
<filter **>
@type record_transformer
<record>
fl.cfgVer "#{ENV['CONFIG_VERSION']}"
</record>
</filter>
# Exclude own namespace logs
# Exclude centaur related logs since they are handled through different flow
<filter kubernetes.**>
@type grep
<exclude>
key tag
pattern /^centaur.*/
</exclude>
</filter>
<filter kubernetes.var.log.containers.fluentd-elasticsearch**>
@type grep
<exclude>
key tag
pattern /.*/
</exclude>
</filter>
<label @CENTAUR>
@include ../2-filter/fl-host.conf
@include ../2-filter/fl-version.conf
<match centaur.metrics.measurement>
@id centaur_metrics_measurement
@type elasticsearch
logstash_prefix "centaur_metrics_measurement"
logstash_dateformat "%Y.%m"
time_key_format "%Y-%m-%dT%H:%M:%S.%N%z"
time_key "centaur_timestamp"
@log_level info
host "<HIDDEN_HOST>"
port "80"
scheme "http"
include_tag_key true
logstash_format true
reload_on_failure false
reload_connections false
reconnect_on_error true
log_es_400_reason true
default_elasticsearch_version 7
validate_client_version true
# See detailed transporter log
# read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-see-detailed-failure-log
with_transporter_log true
# Prevent Request size exceeded error during fluent -> ES data flow
# read more: https://github.com/uken/fluent-plugin-elasticsearch/issues/588
bulk_message_request_threshold 8M
<buffer>
# read more about buffering parameters: https://docs.fluentd.org/configuration/buffer-section#buffering-parameters
@type file
flush_mode interval
retry_type exponential_backoff
flush_thread_count 2
flush_interval 10s
retry_timeout 30m
retry_max_interval 30
chunk_full_threshold 0.8
chunk_limit_size 15M
total_limit_size 96M
overflow_action block
compress gzip
</buffer>
# @log_level debug
</match>
<match centaur.metrics.performance>
@id centaur_metrics_performance
@type elasticsearch
logstash_prefix "centaur_metrics_performance"
logstash_dateformat "%Y.%m"
time_key_format "%Y-%m-%dT%H:%M:%S.%N%z"
time_key "centaur_timestamp"
@log_level info
host "<HIDDEN_HOST>"
port "80"
scheme "http"
include_tag_key true
logstash_format true
# Prevent reloading connections to AWS ES
# read more: https://github.com/atomita/fluent-plugin-aws-elasticsearch-service/issues/15#issuecomment-254793259
# read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
reload_on_failure false
reload_connections false
reconnect_on_error true
log_es_400_reason true
# If you know that your using ES major version is 7, you can set as 7 here.
# read more: https://github.com/uken/fluent-plugin-elasticsearch#fluentd-seems-to-hang-if-it-unable-to-connect-elasticsearch-why
default_elasticsearch_version 7
# Check Elasticsearch instance for an incompatible version
# read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
validate_client_version true
# See detailed transporter log
# read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-see-detailed-failure-log
with_transporter_log true
# Prevent Request size exceeded error during fluent -> ES data flow
# read more: https://github.com/uken/fluent-plugin-elasticsearch/issues/588
bulk_message_request_threshold 8M
<buffer>
# read more about buffering parameters: https://docs.fluentd.org/configuration/buffer-section#buffering-parameters
@type file
flush_mode interval
retry_type exponential_backoff
flush_thread_count 2
flush_interval 10s
retry_timeout 30m
retry_max_interval 30
chunk_full_threshold 0.8
chunk_limit_size 15M
total_limit_size 96M
overflow_action block
compress gzip
</buffer>
</match>
<match centaur.metrics.**>
@id centaur_metrics
@type elasticsearch
logstash_prefix "centaur_metrics_generic"
logstash_dateformat "%Y.%m"
time_key_format "%Y-%m-%dT%H:%M:%S.%N%z"
time_key "centaur_timestamp"
@log_level info
host "<HIDDEN_HOST>"
port "80"
scheme "http"
include_tag_key true
logstash_format true
# Prevent reloading connections to AWS ES
# read more: https://github.com/atomita/fluent-plugin-aws-elasticsearch-service/issues/15#issuecomment-254793259
# read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
reload_on_failure false
reload_connections false
reconnect_on_error true
log_es_400_reason true
# If you know that your using ES major version is 7, you can set as 7 here.
# read more: https://github.com/uken/fluent-plugin-elasticsearch#fluentd-seems-to-hang-if-it-unable-to-connect-elasticsearch-why
default_elasticsearch_version 7
# Check Elasticsearch instance for an incompatible version
# read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
validate_client_version true
# See detailed transporter log
# read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-see-detailed-failure-log
with_transporter_log true
# Prevent Request size exceeded error during fluent -> ES data flow
# read more: https://github.com/uken/fluent-plugin-elasticsearch/issues/588
bulk_message_request_threshold 8M
<buffer>
# read more about buffering parameters: https://docs.fluentd.org/configuration/buffer-section#buffering-parameters
@type file
flush_mode interval
retry_type exponential_backoff
flush_thread_count 2
flush_interval 10s
retry_timeout 30m
retry_max_interval 30
chunk_full_threshold 0.8
chunk_limit_size 15M
total_limit_size 96M
overflow_action block
compress gzip
</buffer>
</match>
<match centaur.logs>
@id centaur_logs
@type elasticsearch
logstash_prefix "centaur_logs"
logstash_dateformat "%Y.%m.%d"
time_key_format "%Y-%m-%dT%H:%M:%S.%N%z"
time_key "centaur_timestamp"
pipeline centaur-pipeline
@log_level info
host "<HIDDEN_HOST>"
port "80"
scheme "http"
include_tag_key true
logstash_format true
# Prevent reloading connections to AWS ES
# read more: https://github.com/atomita/fluent-plugin-aws-elasticsearch-service/issues/15#issuecomment-254793259
# read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
reload_on_failure false
reload_connections false
reconnect_on_error true
log_es_400_reason true
# If you know that your using ES major version is 7, you can set as 7 here.
# read more: https://github.com/uken/fluent-plugin-elasticsearch#fluentd-seems-to-hang-if-it-unable-to-connect-elasticsearch-why
default_elasticsearch_version 7
# Check Elasticsearch instance for an incompatible version
# read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
validate_client_version true
# See detailed transporter log
# read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-see-detailed-failure-log
with_transporter_log true
# Prevent Request size exceeded error during fluent -> ES data flow
# read more: https://github.com/uken/fluent-plugin-elasticsearch/issues/588
bulk_message_request_threshold 8M
<buffer>
# read more about buffering parameters: https://docs.fluentd.org/configuration/buffer-section#buffering-parameters
@type file
flush_mode interval
retry_type exponential_backoff
flush_thread_count 2
flush_interval 10s
retry_timeout 30m
retry_max_interval 30
chunk_full_threshold 0.8
chunk_limit_size 15M
total_limit_size 96M
overflow_action block
compress gzip
</buffer>
</match>
<match centaur.**>
@id centaur_catch_all
@type elasticsearch
logstash_prefix "centaur_logs"
logstash_dateformat "%Y.%m.%d"
pipeline centaur-pipeline
@log_level info
host "<HIDDEN_HOST>"
port "80"
scheme "http"
include_tag_key true
logstash_format true
# Prevent reloading connections to AWS ES
# read more: https://github.com/atomita/fluent-plugin-aws-elasticsearch-service/issues/15#issuecomment-254793259
# read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
reload_on_failure false
reload_connections false
reconnect_on_error true
log_es_400_reason true
# If you know that your using ES major version is 7, you can set as 7 here.
# read more: https://github.com/uken/fluent-plugin-elasticsearch#fluentd-seems-to-hang-if-it-unable-to-connect-elasticsearch-why
default_elasticsearch_version 7
# Check Elasticsearch instance for an incompatible version
# read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-send-events-to-elasticsearch
validate_client_version true
# See detailed transporter log
# read more: https://github.com/uken/fluent-plugin-elasticsearch#cannot-see-detailed-failure-log
with_transporter_log true
# Prevent Request size exceeded error during fluent -> ES data flow
# read more: https://github.com/uken/fluent-plugin-elasticsearch/issues/588
bulk_message_request_threshold 8M
<buffer>
# read more about buffering parameters: https://docs.fluentd.org/configuration/buffer-section#buffering-parameters
@type file
flush_mode interval
retry_type exponential_backoff
flush_thread_count 2
flush_interval 10s
retry_timeout 30m
retry_max_interval 30
chunk_full_threshold 0.8
chunk_limit_size 15M
total_limit_size 96M
overflow_action block
compress gzip
</buffer>
</match>
</label>
Additional context
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days
We didn't actually tamper with FluentD extra metrics, since the service operated normally without any problematic behaviours. The problems shown were observed in the NodeJS Client apps.
What seems to be related was this issue found in NodeJS and also fixed in latest versions: nodejs/node#36650
After that the client behaviors seem to be normal again.
Describe the bug
We are up against a really strange - frustrating problem. I do not have any experience with FluentD at all, so I will try to give a representation as complete as possible.
We have deployed FluentD as a DaemonSet in a Kubernetes cluster. FluentD is configured to gather logs from multiple sources (Docker daemon, network, etc...) and send them to a hosted AWS ElasticSearch.
Along with the mentioned logging, we have in-app mechanisms that log directly to FluentD through a separate
@type forward
source created only for these in-app logging mechanisms, which is then forwarded through amatch @type elasticsearch
.The problem is that this in-app log-flow creates a steady-but-slow memory leak on the node which it runs on.... The even stranger thing is that this leak is not happening in userspace application memory. Both the apps' and Fluentd's app memory remain stable. What is constantly increasing is kernel memory resulting in a constantly decreasing available memory of the node, until memory starvation problems begin. Note that I am referring to non-cache kernel memory that is not freed when requested. The applications are not that logging heavy. Max throughput should be around 10 loglines/sec from all together.
This is not hapenning with any of the other log configuration in Fluent where docker, system, kubernetes logs are scraped etc... If I turn off this in-app mechanism then there is no memory leak!
I have installed different monitoring tools on the server trying to see if some other metric's trend is related to the memory decrease... The only thing that I found matching a lot is IPv4 TCP memory usage, which kinda makes sense since this is how the in-app logs are sent to FluentD and also kernel related. However although the trend is similar, the actual memory amount does not match. In the screenshots attached below for the same time period, you can see that the system memory is decreased around 700MB while TCP memory usage increases only 30MB. However the trend is a complete match!
Any help with this problem would be really appreciated! Feel free to ask any extra information that you might need.
Below are the details about my configuration and set up.
To Reproduce
A simple pod running a NodeJS app sending directly logs to FluentD using the fluent-logger npm package is enough to cause the memory problem.
Expected behavior
I expect the kernel memory to remain stable when usage is also stable, as is the case with the rest of the logging configuration.
Your Environment
Your Configuration
FluentD DaemonSet is deployed using latest (v11.3.0) chart version found in https://github.com/kokuwaio/helm-charts/blob/main/charts/fluentd-elasticsearch/Chart.yaml
Since there is a lot of configuration, I will only put here the relevant config that creates the problem. If all is needed let me know to paste it in a pastebin or sth....
Additional context
The text was updated successfully, but these errors were encountered: