Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multi-threading for netflow #15027

Open
SirBreadc opened this issue Mar 20, 2024 · 4 comments
Open

Allow multi-threading for netflow #15027

SirBreadc opened this issue Mar 20, 2024 · 4 comments
Assignees
Labels
feature request Requests for new plugin and for new features to existing plugins

Comments

@SirBreadc
Copy link

Please direct all support questsions to slack or the forums. Thank you.

I am currently running the following telegraf configuration:

[agent]
  debug = false
  quiet = true
  ## Default data collection interval for all inputs
  #interval = "2s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  #round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 30000

  ## Maximum number of unwritten metrics per output.  Increasing this value
  ## allows for longer periods of output downtime without dropping metrics at the
  ## cost of higher maximum memory usage.
  metric_buffer_limit = 2000000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  #collection_jitter = "100ns"

  ## Collection offset is used to shift the collection by the given amount.
  ## This can be be used to avoid many plugins querying constraint devices
  ## at the same time by manually scheduling them in time.
  # collection_offset = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  #flush_interval = "1s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  #flush_jitter = "0s"

  ## Collected metrics are rounded to the precision specified. Precision is
  ## specified as an interval with an integer + unit (e.g. 0s, 10ms, 2us, 4s).
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  ##
  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s:
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ##
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  precision = "0s"

  ## Log at debug level.
  # debug = false
  ## Log only error level messages.
  # quiet = false

  ## Log target controls the destination for logs and can be one of "file",
  ## "stderr" or, on Windows, "eventlog".  When set to "file", the output file
  ## is determined by the "logfile" setting.
  # logtarget = "file"

  ## Name of the file to be logged to when using the "file" logtarget.  If set to
  ## the empty string then logs are written to stderr.
  # logfile = ""

  ## The logfile will be rotated after the time interval specified.  When set
  ## to 0 no time based rotation is performed.  Logs are rotated only when
  ## written to, if there is no log activity rotation may be delayed.
  # logfile_rotation_interval = "0h"

  ## The logfile will be rotated when it becomes larger than the specified
  ## size.  When set to 0 no size based rotation is performed.
  # logfile_rotation_max_size = "0MB"

  ## Maximum number of rotated archives to keep, any older logs are deleted.
  ## If set to -1, no archives are removed.
  # logfile_rotation_max_archives = 5

  ## Pick a timezone to use when logging or type 'local' for local time.
  ## Example: America/Chicago
  # log_with_timezone = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = "${HOST_HOSTNAME}"
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false

# Netflow v5, Netflow v9 and IPFIX collector
[[inputs.netflow]]
  ## Address to listen for netflow,ipfix or sflow packets.
  ##   example: service_address = "udp://:2055"
  ##            service_address = "udp4://:2055"
  ##            service_address = "udp6://:2055"
  service_address = "udp4://:2055"
  ## Set the size of the operating system's receive buffer.
  ##   example: read_buffer_size = "64KiB"
  ## Uses the system's default if not set.
  # read_buffer_size = ""

  ## Protocol version to use for decoding.
  ## Available options are
  ##   "ipfix"      -- IPFIX / Netflow v10 protocol (also works for Netflow v9)
  ##   "netflow v5" -- Netflow v5 protocol
  ##   "netflow v9" -- Netflow v9 protocol (also works for IPFIX)
  ##   "sflow v5"   -- sFlow v5 protocol
  protocol = "ipfix"

  ## Private Enterprise Numbers (PEN) mappings for decoding
  ## This option allows to specify vendor-specific mapping files to use during
  ## decoding.
  private_enterprise_number_files = ["conf/custom_fields.csv"]

  ## Dump incoming packets to the log
  ## This can be helpful to debug parsing issues. Only active if
  ## Telegraf is in debug mode.
  dump_packets = false

# # Configuration for sending metrics to InfluxDB 2.0
[[outputs.influxdb_v2]]
   

   urls = ["https://<domain>:8086"]
#
#   ## Token for authentication.
   token = "{INFLUXADMINTOKEN}"
#
#   ## Organization is the name of the organization you wish to write to.
   organization = "netflow"
   namepass = ["netflow"]
#
#   ## Destination bucket to write into.
   bucket = "netflow-rt-24h-all"
   content_encoding = "gzip"

I have a Centos 7 VM spun up with 8 CPU cores, 32G RAM and another VM 16 CPU cores and 64G RAM. On each VM I am running an NGINX container which load balances my netlfow across 16 different telegarf nodes.

NGINX config:

events {}

stream {
    upstream telegraf {
        server 172.18.0.8:2055 max_fails=10;
        server 172.18.0.9:2055 max_fails=10;
        server 172.18.0.10:2055 max_fails=10;
        server 172.18.0.11:2055 max_fails=10;
        server 172.18.0.12:2055 max_fails=10;
        server 172.18.0.13:2055 max_fails=10;
        server 172.18.0.14:2055 max_fails=10;
        server 172.18.0.15:2055 max_fails=10;
        server 172.18.0.16:2055 max_fails=10;
        server 172.18.0.17:2055 max_fails=10;
        server 172.18.0.18:2055 max_fails=10;
        server 172.18.0.19:2055 max_fails=10;
        server 172.18.0.20:2055 max_fails=10;
        server 172.18.0.21:2055 max_fails=10;
        server 172.18.0.22:2055 max_fails=10;
        server 172.18.0.23:2055 max_fails=10;
    }

    server {
        listen 2055 udp;
        proxy_bind $remote_addr transparent;
        proxy_pass telegraf;
    }
}

I am noticing that each telegarf container can only take about 25K flows per second but I have some devices sending 45K flows per second. Has anyone else had a similar issue/know how I should spec my telegraf nodes? Atm I am thinking I need a better LB to distribute those flows as round robin doesn't seem to be working very well for NGINX.

NetFlow types are IPFIX, V9 and NSEL which are being ingested

@SirBreadc SirBreadc added the support Telegraf questions, may be directed to community site or slack label Mar 20, 2024
@telegraf-tiger
Copy link
Contributor

Hello! I recommend posting this question in our Community Slack or Community Forums, we have a lot of talented community members there who could help answer your question more quickly. You can also learn more about Telegraf by enrolling at InfluxDB University for free!

Heads up, this issue will be automatically closed after 7 days of inactivity. Thank you!

@srebhan
Copy link
Contributor

srebhan commented Mar 20, 2024

@SirBreadc currently, the handling of the incoming packets is single-threaded in the netflow plugin so effectively you are running 16 processing threads in parallel on the telegraf side. This would boil down to a processing time of approx 1.5ms per packet which sounds about right...

@powersj powersj added the waiting for response waiting for response from contributor label Mar 20, 2024
@SirBreadc
Copy link
Author

@srebhan ok cool, so I guess in this case I just need to enable smarter load balancing to try and keep each telegarf node at 20k flows per second as I've just hit the upper limit. Is there any work/features planned to multi-threaded the netflow plugin? As I feel like others might hit a similar issue as Netflow from a device with a 10G/ 40G link sending with a 1:1 sample rate will definitely send more then 20k flow per second. Would be useful if influx could handle this instead of having to implement our own smart load balancing. (note load balancing UDP seems to be very limited with NGINX... or other tools i've been looking at)

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 20, 2024
@srebhan
Copy link
Contributor

srebhan commented Mar 21, 2024

@SirBreadc I can take a look at multi-threading, but it might take a while...

@srebhan srebhan self-assigned this Mar 21, 2024
@srebhan srebhan added feature request Requests for new plugin and for new features to existing plugins and removed support Telegraf questions, may be directed to community site or slack labels Mar 21, 2024
@srebhan srebhan changed the title Netflow Max Flows per second Allow multi-threading for netflow Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requests for new plugin and for new features to existing plugins
Projects
None yet
Development

No branches or pull requests

3 participants