feat: Ingestion TaskQueue #3136

taimingl · 2024-04-02T03:54:46Z

Add a TaskQueue that serves as IngestBuffer to improve ingestion endpoints responsiveness in the cases where ingestion request spikes.

Basic flow:

Producers, ingestion endpoint handlers, /_json or /_bulk

constructs Task from request payload and buffers into TaskQueue's channel.
responds to client and listens to upcoming requests

├── "/_json"
│   └── pub async fn send_task(task: IngestEntry)
└───────└── TaskQueue::send_task(&self, task: IngestEntry)

Consumers, async workers, managed by TaskQueue

takes Tasks out of TaskQueue's channel in batch
persists received tasks to disk
processes(ingests) each task
updates the persisting disk file

├── Workers::new(count: usize, receiver: Arc<Receiver<IngestEntry>>)
│   └── fn init_worker(receiver: Arc<Receiver<IngestEntry>>)
│       └── async fn process_job(worker_id: String, receiver: Arc<Receiver<IngestEntry>>, store_sig_s: Sender<Option<Vec<IngestEntry>>>)
│       │   └── async fn process_tasks(worker_id: &str, tasks: &Vec<IngestEntry>)
│       └──  async fn persist_job(worker_id: String, store_sig_r: Receiver<Option<Vec<IngestEntry>>>)
└────────── └── fn persist_job_inner(path: &PathBuf, tasks: Option<Vec<IngestEntry>>)

Components

TaskQueue: mpmc of a single channel implemented with 'async-channels'

struct TaskQueue {
    sender: Arc<Sender<IngestEntry>>,
    workers: Arc<Workers>,
    current_size: usize,
}

Main methods:

async fn send_task(&self, task: IngestEntry)
async fn should_add_more_workers(&self)

Configured by:

DEFAULT_CHANNEL_CAP: bounds channel capacity
DEFAULT_WORKER_CNT: default initial & incremental workers count
SEARCHABLE_LATENCY: acceptable latency between a request is accepted and searchable

Workers: consumer side of the TaskQueue

pub(super) struct Workers {
    pub receiver: Arc<Receiver<IngestEntry>>,
    pub handles: RwVec<Worker>,
}

Main methods:

async fn process_job(...)
async fn process_tasks(...)

Configured by:

WORKER_DEFAULT_WAIT_TIME: default wait time between each pull from channel
WORKER_BATCH_PROCESSING_SIZE: max number of requests a worker processes in one batch
WORKER_MAX_IDLE_TIME: max idle time in seconds before a worker shuts itself down

Other util functions:

├── async fn replay_persisted_tasks()
│   └── fn decode_from_wal_file(wal_file: &PathBuf)
└───└── async fn process_tasks(worker_id: &str, tasks: &Vec<IngestEntry>)

…ocessed tasks

…ture

…Queues for TaskQueueManager

…ng to disk

…send the task to

…e reaches max number of workers allowed

…estable before enqueued into ingest buffer

…configure ingest buffer

- Memory: add `ZO_INGEST_BUFFER_QUEUE_CNT` to configure max_worker_count, which limits total memory usage - Latency: added an estimated running average request size that is used to estimate wait time to ensure latency is within the constraint

add current time as timestamp value for buffered ingestion requests. This value will be used when request without timestamp in its payload is being ingested. This timestamp also helps maintain the order of the logs.

compress/decompress serialized/deserialized data to/from wal when persisting requests to disk.

taimingl · 2024-05-10T03:17:50Z

/ok-to-test sha=578578a

…ention

…of estimating

github-actions bot added the ✏️ Feature label Apr 2, 2024

taimingl added 2 commits April 2, 2024 16:11

feat: mpmc TaskQueue built with async-channel

d086a91

feat: +/- num of workers per queue dynamically

153939c

taimingl force-pushed the feat/input-buffer branch from 2ebee7c to 153939c Compare April 2, 2024 23:13

taimingl added 26 commits April 2, 2024 16:21

fix rebase issue

376c75d

minor issue

d589c9e

feat: task queue manager, persisting job, task serialization

5145185

feat: persist pending task to disk and replay previously saved & unpr…

6f3c42b

…ocessed tasks

fix wal replay

f225a28

use channel between a worker's two tasks instead of shared data struc…

1755b60

…ture

clean up

6adb15a

fixes & improvements

68a3926

more improvements & error handling

bb81840

feat: gracefully shut down & documentation

cf8aa45

small clean up

a1fe625

fix: fix clippy

1b39c8a

chore: add feature_ingest_buffer_enabled env var

f456fa1

test fixes

8754799

Merge branch 'main' into feat/input-buffer

66e7cd4

doc for static variable

fc4ccbf

fix merge issue

6fba563

refactor: unpair TaskQueue with stream_name, but a set number of Task…

4677495

…Queues for TaskQueueManager

Merge branch 'main' into feat/input-buffer

50488c1

fix: init new workers when all prev. workers shut down

9bfc6c4

fix InegstEntry serialization/deserialization issue & test

0aef438

Merge branch 'main' into feat/input-buffer

3c037ca

fix read/write IngestEntries from/to wal files

93cb46d

add ingest_buffer_queue_num env and fix file operations when persisti…

458f87b

…ng to disk

fix clippy

d6530ff

add task_queue support to .../_bulk endpoint

85b70da

taimingl added 20 commits May 5, 2024 14:59

perf: a better way for TaskQueueManager to choose which TaskQueue to …

c0f24b9

…send the task to

perf: more optimization on how to choose TaskQueue when the chosen on…

4520a31

…e reaches max number of workers allowed

Merge branch 'main' into feat/input-buffer

119afc8

Merge branch 'main' into feat/input-buffer

94dbc72

perf: add validation on ingestion requests to ensure requests are ing…

7983389

…estable before enqueued into ingest buffer

Merge branch 'main' into feat/input-buffer

8dc3d2c

perf: add envs ZO_INGEST_BUFFER_SIZE & ZO_INGEST_BUFFER_THRESHOLD to …

e5e46a1

…configure ingest buffer

fix: TaskQueue config calculation

7a03876

update: use from http header to validate received ingestion requests

f1f3155

feat: add timestsamp for buffered ingestion requests

c12a29b

add current time as timestamp value for buffered ingestion requests. This value will be used when request without timestamp in its payload is being ingested. This timestamp also helps maintain the order of the logs.

feat: ingest_buffer compression for disk r/w

5749b63

compress/decompress serialized/deserialized data to/from wal when persisting requests to disk.

chore: remove debug logging

785b7b3

Merge branch 'main' into feat/input-buffer

680e754

update: add entry validation when worker receives it from the channel

c4657a6

Merge branch 'main' into feat/input-buffer

df0f12e

fix: fix task queue wait time calculation

16d1605

update: reduce unnecessary logging level

8447c23

Merge branch 'main' into feat/input-buffer

b2df70a

fix: fix clippy

578578a

taimingl added 7 commits May 12, 2024 17:56

Merge branch 'main' into feat/input-buffer

967ec1a

perf: use tokio::select for timeout directly to avoid cloning

9ce459c

update: limit ingest_buffer to router node only in cluter mode

b69a97d

perf: remove TaskQueueManager by using one global queue to avoid cont…

d2ebf7b

…ention

track the total size of buffered requests to limit mem usage instead …

e6d83cc

…of estimating

chore: clean up & documentation

e08f2bb

Merge branch 'main' into feat/input-buffer3

4b4608f

taimingl force-pushed the feat/input-buffer branch from f6d5824 to 4b4608f Compare May 13, 2024 22:51

taimingl closed this May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Ingestion TaskQueue #3136

feat: Ingestion TaskQueue #3136

taimingl commented Apr 2, 2024 •

edited

taimingl commented May 10, 2024

feat: Ingestion TaskQueue #3136

feat: Ingestion TaskQueue #3136

Conversation

taimingl commented Apr 2, 2024 • edited

Basic flow:

Components

taimingl commented May 10, 2024

taimingl commented Apr 2, 2024 •

edited