Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple issues with Fluent Bit - SIGSEGV, error deleting file at close, fail to drop enough chunks in order to place new data #8823

Open
mirko-lazarevic opened this issue May 14, 2024 · 1 comment

Comments

@mirko-lazarevic
Copy link

mirko-lazarevic commented May 14, 2024

Bug Report

Describe the bug

We are experiencing issues with in our Fluent Bit deployments in Kubernetes cluster, where 1 or 2 pods entering a CrashLoopBackOff state on a cluster consisting of 6 to 8 nodes.

To Reproduce

  • It's not easy to reproduce the issue
  • Example log messages:

(SIGSEGV)

Fluent Bit v3.0.2
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

___________.__                        __    __________.__  __          ________  
\_   _____/|  |  __ __   ____   _____/  |_  \______   \__|/  |_  ___  _\_____  \ 
 |    __)  |  | |  |  \_/ __ \ /    \   __\  |    |  _/  \   __\ \  \/ / _(__  < 
 |     \   |  |_|  |  /\  ___/|   |  \  |    |    |   \  ||  |    \   / /       \
 \___  /   |____/____/  \___  >___|  /__|    |______  /__||__|     \_/ /______  /
     \/                     \/     \/               \/                        \/ 

[2024/05/02 08:53:20] [ info] [fluent bit] version=3.0.2, commit=, pid=1
[2024/05/02 08:53:20] [ info] [storage] ver=1.5.2, type=memory+filesystem, sync=normal, checksum=off, max_chunks_up=192
[2024/05/02 08:53:20] [ info] [storage] backlog input plugin: storage_backlog.1
[2024/05/02 08:53:20] [ info] [cmetrics] version=0.7.3
[2024/05/02 08:53:20] [ info] [ctraces ] version=0.4.0
[2024/05/02 08:53:20] [ info] [input:tail:tail.0] initializing
[2024/05/02 08:53:20] [ info] [input:tail:tail.0] storage_strategy='filesystem' (memory + filesystem)
[2024/05/02 08:53:20] [ info] [input:tail:tail.0] db: delete unmonitored stale inodes from the database: count=0
[2024/05/02 08:53:20] [ info] [input:storage_backlog:storage_backlog.1] initializing
[2024/05/02 08:53:20] [ info] [input:storage_backlog:storage_backlog.1] storage_strategy='memory' (memory only)
[2024/05/02 08:53:20] [ info] [input:storage_backlog:storage_backlog.1] queue memory limit: 95.4M
[2024/05/02 08:53:20] [ info] [input:emitter:emitter_for_rewrite_tag.0] initializing
[2024/05/02 08:53:20] [ info] [input:emitter:emitter_for_rewrite_tag.0] storage_strategy='memory' (memory only)
[2024/05/02 08:53:20] [ info] [filter:kubernetes:kubernetes.2] https=1 host=kubernetes.default.svc port=443
[2024/05/02 08:53:20] [ info] [filter:kubernetes:kubernetes.2]  token updated
[2024/05/02 08:53:20] [ info] [filter:kubernetes:kubernetes.2] local POD info OK
[2024/05/02 08:53:20] [ info] [filter:kubernetes:kubernetes.2] testing connectivity with Kubelet...
[2024/05/02 08:53:21] [ info] [filter:kubernetes:kubernetes.2] connectivity OK
[2024/05/02 08:53:22] [ info] [http_server] listen iface=0.0.0.0 tcp_port=8081
[2024/05/02 08:53:22] [ info] [sp] stream processor started
[2024/05/02 08:53:22] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/1-1714639982.798393729.flb
.
.
.
.
[2024/05/02 08:56:18] [ info] [input] tail.0 resume (storage buf overlimit 191/192)
[2024/05/02 08:56:18] [ warn] [input] tail.0 paused (storage buf overlimit 192/192)
[2024/05/02 08:56:18] [ info] [input] pausing tail.0
[2024/05/02 08:56:18] [ info] [input] resume tail.0
[2024/05/02 08:56:18] [ info] [input] tail.0 resume (storage buf overlimit 191/192)
[2024/05/02 08:56:18] [ warn] [input] tail.0 paused (storage buf overlimit 192/192)
[2024/05/02 08:56:18] [ info] [input] pausing tail.0
[2024/05/02 08:56:18] [ info] [input] resume tail.0
[2024/05/02 08:56:18] [ info] [input] tail.0 resume (storage buf overlimit 191/192)
[2024/05/02 08:56:36] [ info] [input:tail:tail.0] inode=262825 link_rotated: /var/log/containers/xxxxxx-9a40478e6859b88a9ba7db65755da522de6a03465eab2f88e72ca0ec1456d930.log
[2024/05/02 08:56:36] [ info] [input:tail:tail.0] inode=1972490 handle rotation(): /var/log/containers/xxxxxxxxx-9a40478e6859b88a9ba7db65755da522de6a03465eab2f88e72ca0ec1456d930.log => /var/data/kubeletlogs/5018a8564e8120570150b0764d8d194d_c-58a88ecb-c183-4d86-a4d3-e760f2377aa6-cronjob-28577336-r84gx_359216b1-1c12-4b4d-9bca-7a21df8f70af/c-58a88ecb-c183-4d86-a4d3-e760f2377aa6-cronjob/0.log (deleted)
.
.
.
.
.
2024/05/02 08:56:36] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972499 watch_fd=731
[2024/05/02 08:56:36] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972490 watch_fd=728
[2024/05/02 08:56:36] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972332 watch_fd=732
[2024/05/02 08:56:36] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972628 watch_fd=733
[2024/05/02 08:56:36] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972392 watch_fd=729
[2024/05/02 08:56:36] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972613 watch_fd=730
[/usr/local/src/fluent-bit/lib/chunkio/src/cio_file_unix.c:468 errno=2] No such file or directory
[2024/05/02 08:56:48] [error] [storage] [cio file] error deleting file at close tail.0:1-1714640207.838891586.flb
[2024/05/02 08:56:54] [engine] caught signal (SIGSEGV)

Fail to drop enough chunks in order to place new data

Fluent Bit v3.0.2
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

___________.__                        __    __________.__  __          ________  
\_   _____/|  |  __ __   ____   _____/  |_  \______   \__|/  |_  ___  _\_____  \ 
 |    __)  |  | |  |  \_/ __ \ /    \   __\  |    |  _/  \   __\ \  \/ / _(__  < 
 |     \   |  |_|  |  /\  ___/|   |  \  |    |    |   \  ||  |    \   / /       \
 \___  /   |____/____/  \___  >___|  /__|    |______  /__||__|     \_/ /______  /
     \/                     \/     \/               \/                        \/ 

[2024/05/02 08:50:43] [ info] [fluent bit] version=3.0.2, commit=, pid=1
[2024/05/02 08:50:43] [ info] [storage] ver=1.5.2, type=memory+filesystem, sync=normal, checksum=off, max_chunks_up=192
[2024/05/02 08:50:43] [ info] [storage] backlog input plugin: storage_backlog.1
[2024/05/02 08:50:43] [ info] [cmetrics] version=0.7.3
[2024/05/02 08:50:43] [ info] [ctraces ] version=0.4.0
[2024/05/02 08:50:43] [ info] [input:tail:tail.0] initializing
[2024/05/02 08:50:43] [ info] [input:tail:tail.0] storage_strategy='filesystem' (memory + filesystem)
[2024/05/02 08:50:44] [ info] [input:tail:tail.0] db: delete unmonitored stale inodes from the database: count=0
[2024/05/02 08:50:44] [ info] [input:storage_backlog:storage_backlog.1] initializing
[2024/05/02 08:50:44] [ info] [input:storage_backlog:storage_backlog.1] storage_strategy='memory' (memory only)
[2024/05/02 08:50:44] [ info] [input:storage_backlog:storage_backlog.1] queue memory limit: 95.4M
[2024/05/02 08:50:44] [ info] [input:emitter:emitter_for_rewrite_tag.0] initializing
[2024/05/02 08:50:44] [ info] [input:emitter:emitter_for_rewrite_tag.0] storage_strategy='memory' (memory only)
[2024/05/02 08:50:44] [ info] [filter:kubernetes:kubernetes.2] https=1 host=kubernetes.default.svc port=443
[2024/05/02 08:50:44] [ info] [filter:kubernetes:kubernetes.2]  token updated
[2024/05/02 08:50:44] [ info] [filter:kubernetes:kubernetes.2] local POD info OK
[2024/05/02 08:50:44] [ info] [filter:kubernetes:kubernetes.2] testing connectivity with Kubelet...
[2024/05/02 08:50:44] [ info] [filter:kubernetes:kubernetes.2] connectivity OK
[2024/05/02 08:50:45] [ info] [http_server] listen iface=0.0.0.0 tcp_port=8081
[2024/05/02 08:50:45] [ info] [sp] stream processor started
[2024/05/02 08:50:45] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/1-1714639842.499964038.flb
[2024/05/02 08:50:45] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1972446 watch_fd=1 name=/var/log/containers/xxxxxxxxxx-d9ae54902f2f65f8d214e2d68a5e2b3336062fb30ef411e05da8c6f1969980a2.log
[2024/05/02 08:50:45] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1972463 watch_fd=2 name=/var/log/containers/xxxxxxxxxx-445b3a6bb43b9b3ed3b836977091aaaa425cfc145581d8cb095208e767b9907b.log
[2024/05/02 08:50:45] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1860937 watch_fd=3 name=/var/log/containers/xxxxxxxxxx-b80c8b4cde44b2217a4b2814b3250a7545af53b3e9145fa27ff9e6f8f2a637b3.log
[2024/05/02 08:50:45] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1860893 watch_fd=4 name=/var/log/containers/xxxxxxxxxx-33a9979210e4c5ee454482a74429b50e129fea1872ac013185f233939319902f.log
.
.
.
.
[2024/05/02 08:51:41] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1966324 watch_fd=724 name=/var/log/containers/xxxxxxxx-48c64a8d9e6c1a24ff5abae5616eda485f4b8580073be41a41e6506e06d0a00a.log
[2024/05/02 08:51:42] [ warn] [input] tail.0 paused (storage buf overlimit 192/192)
[2024/05/02 08:51:42] [ info] [input] pausing tail.0
[2024/05/02 08:51:42] [ info] [input] resume tail.0
[2024/05/02 08:51:42] [ info] [input] tail.0 resume (storage buf overlimit 191/192)
[2024/05/02 08:51:42] [ warn] [input] tail.0 paused (storage buf overlimit 192/192)
[2024/05/02 08:51:42] [ info] [input] pausing tail.0
[2024/05/02 08:51:42] [ info] [input] resume tail.0
[2024/05/02 08:51:42] [ info] [input] tail.0 resume (storage buf overlimit 191/192)
[2024/05/02 08:51:43] [ warn] [input] tail.0 paused (storage buf overlimit 192/192)
[2024/05/02 08:51:43] [ info] [input] pausing tail.0
[2024/05/02 08:51:43] [ info] [input] resume tail.0
[2024/05/02 08:51:43] [ info] [input] tail.0 resume (storage buf overlimit 191/192)
[2024/05/02 08:51:43] [ warn] [input] tail.0 paused (storage buf overlimit 192/192)
[2024/05/02 08:51:43] [ info] [input] pausing tail.0
[2024/05/02 08:51:43] [ info] [input] resume tail.0
.
.
.
.
[2024/05/02 08:52:33] [ info] [input:tail:tail.0] inode=262829 link_rotated: /var/log/containers/xxxxxxxxxxxx-79289f1aefefd7d290c024a47554831c0ab01c4975e47f21859ae1f2347efec6.log
[2024/05/02 08:52:33] [ info] [input:tail:tail.0] inode=1972760 handle rotation(): /var/log/containers/xxxxxxxxxxxx-79289f1aefefd7d290c024a47554831c0ab01c4975e47f21859ae1f2347efec6.log => /var/data/kubeletlogs/18d9f87e0d6a4a9d9d9c0966dd7b504d_c-46c15c32-872e-4c5f-b45b-f35e4d4e9c77-cronjob-28577332-624ml_c506bd27-ae63-44f2-8576-c8e1f29c1dc6/c-46c15c32-872e-4c5f-b45b-f35e4d4e9c77-cronjob/0.log (deleted)
.
.
.
.
[2024/05/02 08:52:35] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972760 watch_fd=725
[2024/05/02 08:52:35] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972736 watch_fd=726
[2024/05/02 08:52:35] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972789 watch_fd=727
[2024/05/02 08:52:35] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972759 watch_fd=728
[2024/05/02 08:52:35] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972678 watch_fd=729
[2024/05/02 08:52:35] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972793 watch_fd=730
[/usr/local/src/fluent-bit/lib/chunkio/src/cio_file_unix.c:468 errno=2] No such file or directory
[2024/05/02 08:52:42] [error] [storage] [cio file] error deleting file at close tail.0:1-1714639961.986152069.flb
[/usr/local/src/fluent-bit/lib/chunkio/src/cio_file_unix.c:468 errno=2] No such file or directory
[2024/05/02 08:52:58] [error] [storage] [cio file] error deleting file at close tail.0:1-1714639977.968069594.flb
[2024/05/02 08:53:02] [error] [input chunk] fail to drop enough chunks in order to place new data

Expected behavior

  • Fluent Bit should not crash with a segmentation fault (SIGSEGV)
  • Fluent Bit should succeed without encountering No such file or directory
  • Fluent Bit should succeed without encountering Fail to drop enough chunks in order to place new data

Screenshots

Your Environment

  • Version used: Fluent Bit v3.0.2
  • Configuration:
[SERVICE]
      Flush                   1
      Log_Level               info
      Daemon                  off
      Parsers_File            parsers.conf
      Plugins_File            plugins.conf
      HTTP_Server             On
      HTTP_Listen             0.0.0.0
      HTTP_Port               8081
      Health_Check            On
      HC_Errors_Count         1
      HC_Retry_Failure_Count  1
      HC_Period               30
      storage.path            /fluent-bit/cache
      storage.max_chunks_up   192
      storage.metrics         On

[INPUT]
      Name                              tail
      Tag                               kube.*
      Path                              /var/log/containers/logger-heartbeat-*.log
      Path_Key                          file
      DB                                /var/log/fluent-bit/fluent-bit.DB
      Parser                            cri
      Skip_Long_Lines                   On
      Refresh_Interval                  10
      storage.type                      filesystem
      storage.pause_on_chunks_overlimit on

[FILTER]
        Name                  rewrite_tag
        Match_Regex           <some-regex>
        Emitter_Mem_Buf_Limit 384MB
        Rule                  $message <some-regex> new.tag true

[FILTER]
        Name          grep
        Match_Regex   <some-regex>
        Exclude       $message <some-regex>

[FILTER]
        Name                kubernetes
        Match               kube.*
        Buffer_Size         10MB
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off
        Use_Kubelet         On
        Kubelet_Port        10250
        Kubelet_Host        ${HOST_IP}

[FILTER]
        Name record_modifier
        Match *
        Record meta.cluster_name my-cluster
        Record node_name my-node

[FILTER]
        Name nest
        Match *
        Operation nest
        Wildcard meta.*
        Nest_under meta
        Remove_prefix meta.

[FILTER]
        NAME          lua
        Match_Regex   <some-regex>
        call          removeSTFields
        script        /fluent-bit/bin/script.lua

[FILTER]
        NAME    lua
        Match   new.tag.*
        call    removeUnwantedLogFields
        script  /fluent-bit/bin/script.lua

[OUTPUT]
        Name logger-agent
        Id logger-agent-output-1
        Match kube.*
        Retry_Limit 8

        # Connection
        Target_Host localhost
        Target_Port 8080
        Target_Path /v1/logs/ws

        # Authentication
        Authentication_Mode APIKey

        # Buffer storage
        storage.total_limit_size 5G

[OUTPUT]
        Name              logger-agent
        Id                logger-agent-output-2
        Match             new.tag.*
        Retry_Limit       8

        # Connection
        Target_Host localhost
        Target_Port 8080
        Target_Path /v1/logs/service

        # Authentication
        Authentication_Mode APIKey

        # Buffer storage
        storage.total_limit_size 5G
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.28
  • Server type and version:
  • Operating System and version:
  • Filters and plugins: See config above.

Additional context

  • We observed the issue in an environment with high log volume, 185 pods each with 2 containers where the fluent-bit instance crashed after 3-5 mins.
  • We observed the issue in an environment with low log volume, where the fluent-bit instance crashed after a day approximately.
  • Output plugins are custom made plugins written in Golang, they are used to send records via websocket to a configured endpoint
@mirko-lazarevic
Copy link
Author

Does anyone have any insights into what might be going wrong here? I've noticed that Fluent Bit suddenly exits when it encounters the error [input chunk] fail to drop enough chunks in order to place new data. Is this behavior intentional? Shouldn't Fluent Bit shut down gracefully and release resources properly?

...
if (count != 0) {
        flb_error("[input chunk] fail to drop enough chunks in order to place new data");
        exit(0);
}

cc: @agup006, @ScarletTanager

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant