Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many open files #1099

Closed
gricuk opened this issue Sep 15, 2022 · 8 comments · Fixed by #1729
Closed

too many open files #1099

gricuk opened this issue Sep 15, 2022 · 8 comments · Fixed by #1729
Assignees
Labels
buffer bug Something isn't working fluentd pinned
Milestone

Comments

@gricuk
Copy link

gricuk commented Sep 15, 2022

We're gettingtoo many open fileserror in few clusters with image ghcr.io/banzaicloud/fluentd:v1.14.5-alpine-1 and higher.

It happens randomly and I can't provide any conditions, but could you please check the following PR fluent/fluentd#3844. It was fixed in 1.15.2 release.

Now we have downgraded fluentd to ghcr.io/banzaicloud/fluentd:v1.14.4-alpine-2 and all is okay with this version (but I'm not sure because as I mentioned above it happens randomly and not all clusters have this error)

Environment details:

  • Kubernetes version v1.21.12+rke2r2:
  • Cloud-provider/provisioner (e.g. AKS, GKE, EKS, PKE etc): RKE2
  • logging-operator version (e.g. 2.1.1): logging-operator-3.17.9
  • Install method (e.g. helm or static manifests): helm

/kind bug

@gricuk gricuk added the bug Something isn't working label Sep 15, 2022
@1it
Copy link

1it commented Oct 14, 2022

It could be caused by chunk_limit_size: 8MB, which is set here. This actually a nonsense, it should be at least 256MB (Fluentd default). 8MB is a default value for in memory buffer, not for file based.
I've had about 32k buffer files in one instance because of that and constant too many open files errors.

@stale
Copy link

stale bot commented Apr 12, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions!

@stale stale bot added the wontfix This will not be worked on label Apr 12, 2023
@pepov pepov removed the wontfix This will not be worked on label Apr 13, 2023
@pepov pepov added this to the 4.2 milestone Apr 13, 2023
@pepov
Copy link
Member

pepov commented Apr 13, 2023

Thanks for reporting this and sorry for the late reply. If I remember correctly previously fluentd used 256M for memory buffers by default (or at least it was the case in some old versions), which caused OOM issues, that was the reason why we did this:
#751

I beleive we can remove this explicit default and rely on the fluentd settings if we can verify our supported fluentd images now does it correctly.

Scheduling this to the next minor release.

@pepov pepov added the buffer label Apr 27, 2023
@pepov pepov mentioned this issue Apr 27, 2023
@stale
Copy link

stale bot commented Jun 26, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions!

@stale stale bot added the wontfix This will not be worked on label Jun 26, 2023
@pepov pepov removed the wontfix This will not be worked on label Jun 26, 2023
@pepov
Copy link
Member

pepov commented Jun 27, 2023

According to the docs, the chunk_limit_size is now smarter as it uses 8MB for memory and 256MB for file buffers by default:
https://docs.fluentd.org/configuration/buffer-section#buffering-parameters

We can remove our default setting to rely on fluentd's settings.

@pepov pepov modified the milestones: 4.3, 4.x Jul 4, 2023
@giakinh0823
Copy link

giakinh0823 commented Aug 26, 2023

This is my issue. I have 2 node in k8s.

  • Node1: master + worker: This is node error too many open files
  • Node2: worker: This is node running success with deamonset
    Please help me. Thank you.

Here are my settings:

apiVersion: logging.banzaicloud.io/v1beta1
kind: Logging
metadata:
  name: default-logging-simple
  namespace: logging
spec:
  fluentd:
    image:
      pullPolicy: IfNotPresent
      repository: rancher/banzaicloud-fluentd
      tag: v1.11.5-alpine-9
    configReloaderImage:
      repository: jimmidyson/configmap-reload 
      tag: latest  # Lỗi thì thay đổi version
      pullPolicy: IfNotPresent
    security:
      roleBasedAccessControlCreate: true
    resources:
      requests:
        cpu: 10m
        memory: 100M
      limits:
        cpu: 1
        memory: 3Gi
    bufferStorageVolume:
      pvc:
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 40Gi
          storageClassName: longhorn
          volumeMode: Filesystem
    scaling:
      drain:
        enabled: true
      replicas: 1

  fluentbit:
    image:
      pullPolicy: Always
      repository: fluent/fluent-bit
      tag: 2.1.8
    security:
      roleBasedAccessControlCreate: true
    resources:
      requests:
        cpu: 10m
        memory: 100M
      limits:
        cpu: 1
        memory: 2Gi
    positiondb:
      hostPath:
        path: ""
    bufferStorageVolume:
      hostPath:
        path: ""
  controlNamespace: logging
  watchNamespaces: ["production"]

image

image

@giakinh0823
Copy link

Thanks but I fixed it.

I created another daemonset and set up the information below

sysctl -w fs.inotify.max_queued_events=524288
sysctl -w fs.inotify.max_user_instances=16383
sysctl -w fs.inotify.max_user_watches=524288
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    name: es-ds
  name: es-ds
  namespace: logging
spec:
  selector:
    matchLabels:
      name: es-ds
  template:
    metadata:
      labels:
        name: es-ds
    spec:
      containers:
      - name: es-ds
        image: gcr.io/google-containers/startup-script:v1
        imagePullPolicy: IfNotPresent
        securityContext:
          privileged: true
        env:
        - name: STARTUP_SCRIPT
          value: |
            #! /bin/bash
            sysctl -w vm.max_map_count=262144
            sysctl -w fs.inotify.max_queued_events=524288
            sysctl -w fs.inotify.max_user_instances=16383
            sysctl -w  fs.inotify.max_user_watches=524288
            echo "done"

@pepov
Copy link
Member

pepov commented Aug 28, 2023

@giakinh0823 These last two comments are unrelated to the original issue topic (fluentd vs fluentbit), could you please create a separate issue for that?

@pepov pepov added the pinned label Sep 7, 2023
@pepov pepov added the fluentd label Oct 9, 2023
@pepov pepov modified the milestones: 4.x, 4.6 Jan 9, 2024
@pepov pepov modified the milestones: 4.6, 4.7 Mar 28, 2024
@OverOrion OverOrion self-assigned this Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer bug Something isn't working fluentd pinned
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants