Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication (snapshot and live) from one Elasticache Redis to another is failing #107

Open
nfayzullaev opened this issue Apr 21, 2023 · 2 comments

Comments

@nfayzullaev
Copy link

Hello,

I'm running the following command to replicate from 1 Elasticache cluster to another:
JAVA_OPTS="-Xmx64g" ./riot-redis -i --stacktrace -h source.cache.amazonaws.com -p 6379 replicate -h target.cache.amazonaws.com -p 6379 --batch 1

Source cluster INFO output:

# Server
redis_version:3.2.4
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:0
redis_mode:standalone
os:Amazon ElastiCache
arch_bits:64
multiplexing_api:epoll
gcc_version:0.0.0
process_id:1
run_id:8cd8a940be86564d937725bc7dc967c9e51a94d7
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
tcp_port:6379
uptime_in_seconds:39663067
uptime_in_days:459
hz:10
lru_clock:4366755
total_commands_processed:2756363612
instantaneous_ops_per_sec:3
total_net_input_bytes:195785379317
total_net_output_bytes:668172690832
instantaneous_input_kbps:0.12
instantaneous_output_kbps:4.46
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:904962796
keyspace_misses:8286
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
role:slave
master_host:source.cache.amazonaws.com
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

# Stats
total_connections_received:1463228
total_commands_processed:2756363612
instantaneous_ops_per_sec:3
total_net_input_bytes:195785379317
total_net_output_bytes:668172690832
instantaneous_input_kbps:0.12
instantaneous_output_kbps:4.46
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:904962796
keyspace_misses:8286
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
role:slave
master_host:source.cache.amazonaws.com
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:623807091890
repl_sync_enabled:1
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:17899.54
used_cpu_user:49092.53
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=63463817,expires=33322835,avg_ttl=0
db5:keys=45502,expires=1812,avg_ttl=0

Target cluster INFO output:

# Server
redis_version:6.2.6
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:0
redis_mode:standalone
os:Amazon ElastiCache
arch_bits:64
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:0.0.0
process_id:1
run_id:1317126a7e9e3af4e759006a6aaafc43fa0913f5
tcp_port:6379
server_time_usec:1682088327009544
uptime_in_seconds:217224
uptime_in_days:2
hz:10
configured_hz:10
lru_clock:4366726
executable:-
config_file:-

# Clients
connected_clients:7
cluster_connections:0
maxclients:65000
client_recent_max_input_buffer:32
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

# Memory
used_memory:5273747832
used_memory_human:4.91G
used_memory_rss:5364830208
used_memory_rss_human:5.00G
used_memory_peak:5387511200
used_memory_peak_human:5.02G
used_memory_peak_perc:97.89%
used_memory_overhead:1478635424
used_memory_startup:9306856
used_memory_dataset:3795112408
used_memory_dataset_perc:72.09%
allocator_allocated:5278263360
allocator_active:5279645696
allocator_resident:5363998720
used_memory_lua:33792
used_memory_lua_human:33.00K
master_repl_offset:240700495402
second_repl_offset:157087304785
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:240699446827
repl_backlog_histlen:1048576

# CPU
used_cpu_sys:18637.736549
used_cpu_user:27639.798232
used_cpu_sys_children:2.001466
used_cpu_user_children:13.251697
used_cpu_sys_main_thread:252.775360
used_cpu_user_main_thread:14353.886320

# Modules

# Errorstats
errorstat_ERR:count=2387

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=20264055,expires=10620831,avg_ttl=1481086241362

The process gets stuck at 31% and never completes.
Typical message is as follows:

14:35:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264349/63458641 (11:16:18 / 14:36:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264350/63458641 (11:17:18 / 14:37:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264351/63458641 (11:18:18 / 14:38:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264352/63458641 (11:19:18 / 14:39:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264353/63458641 (11:20:18 / 14:40:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264354/63458641 (11:21:18 / 14:41:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264355/63458641 (11:22:18 / 14:42:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264356/63458641 (11:23:18 / 14:43:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264357/63458641 (11:24:18 / 14:44:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264358/63458641 (11:25:18 / 24:20:45) 492.8/s
14:45:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264359/63458641 (11:26:18 / 24:22:53) 492.1/s
14:46:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264360/63458641 (11:27:18 / 24:25:01) 491.4/s
14:47:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264361/63458641 (11:28:18 / 24:27:09) 490.7/s
14:48:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264362/63458641 (11:29:18 / 24:29:17) 490.0/s
14:49:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264363/63458641 (11:30:18 / 24:31:25) 489.3/s
14:50:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264364/63458641 (11:31:18 / 24:33:33) 488.6/s
14:51:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264365/63458641 (11:32:18 / 24:35:40) 487.9/s
14:52:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
14:52:09.373 INFO io.lettuce.core.protocol.ReconnectionHandler	: Reconnected to target.cache.amazonaws.com:6379

Sometimes I see the following stacktrace error:

15:29:04.475 SEVERE org.springframework.batch.core.step.AbstractStep    : Encountered an error executing step scan-reader in job scan-reader
org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.; nested exception is java.util.concurrent.TimeoutException
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$5.recover(FaultTolerantChunkProcessor.java:429)
        at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:539)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:387)
        at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:255)
        at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:444)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
        at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
        at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
        at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
        at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:375)
        at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
        at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:145)
        at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:258)
        at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
        at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
        at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413)
        at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136)
        at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
        at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException
        at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
        at com.redis.spring.batch.reader.KeyDumpValueReader.read(KeyDumpValueReader.java:42)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:77)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:27)
        at com.redis.spring.batch.RedisItemReader$Enqueuer.write(RedisItemReader.java:151)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$3.doWithRetry(FaultTolerantChunkProcessor.java:348)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
        ... 21 more

15:29:04.475 INFO org.springframework.batch.core.step.AbstractStep      : Step: [scan-reader] executed in 1h34m57s282ms

During live replication (--mode live), I'm getting the following error:

17:47:45.573 WARNING com.redis.spring.batch.reader.KeyspaceNotificationItemReader       : Could not add key because queue is full
17:47:46.145 SEVERE org.springframework.batch.core.step.AbstractStep    : Encountered an error executing step live-reader in job live-reader
org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.; nested exception is java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$5.recover(FaultTolerantChunkProcessor.java:429)
        at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:539)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:387)
        at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:255)
        at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:444)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
        at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
        at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
        at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
        at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:375)
        at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
        at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:145)
        at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:258)
        at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
        at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
        at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413)
        at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136)
        at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
        at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022)
        at com.redis.spring.batch.reader.KeyDumpValueReader.read(KeyDumpValueReader.java:42)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:77)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:27)
        at com.redis.spring.batch.RedisItemReader$Enqueuer.write(RedisItemReader.java:151)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$3.doWithRetry(FaultTolerantChunkProcessor.java:348)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
        ... 21 more
Caused by: java.lang.OutOfMemoryError: Java heap space

Source & Target cluster both have 2 nodes in the replication group - each one with 16 CPUs & 64GB RAM.
Replication instance has 32 CPUs & 128GB RAM.

I tried with many-many options (played around with different --batch sizing, added --no-verify, --skip-policy=ALWAYS, multiple threads) - however nothing is working. Wondering if you can take a look and help me.

Thank you,
Nodir

@CrazyFunker
Copy link

@nfayzullaev did you manage to solve the issue?

The tool works on a smaller DB (4GB in staging), but fails around 6GB (15GB in prod).
We ran into a similar problem, but getting TimeoutExceptions.
We tried playing around with multiple parameters, increasing and decreasing timeouts etc, but no luck
In AWS we're not exceeding any network bandwidth allowance excess

./riot --info --stacktrace -h source.cache.amazonaws.com -p 6379 replicate --mode=SNAPSHOT -h target.cache.amazonaws.com -p 6379 --read-batch=10 --read-pool=8 --metrics --threads 4 --type=DUMP --timeout=30 --read-threads=2 --wait-timeout=4000 --write-pool=4

16:13:12.427 SEVERE org.springframework.batch.core.step.AbstractStep	: Encountered an error executing step scan-reader in job scan-reader
org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.; nested exception is java.util.concurrent.TimeoutException
	at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$5.recover(FaultTolerantChunkProcessor.java:429)
	at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:539)
	at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:387)
	at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:255)
	at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217)
	at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:444)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
	at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
	at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
	at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
	at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:375)
	at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
	at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:145)
	at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:258)
	at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
	at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
	at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413)
	at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136)
	at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
	at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException
	at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
	at com.redis.spring.batch.reader.KeyDumpValueReader.read(KeyDumpValueReader.java:42)
	at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:77)
	at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:27)
	at com.redis.spring.batch.RedisItemReader$Enqueuer.write(RedisItemReader.java:151)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
	at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$3.doWithRetry(FaultTolerantChunkProcessor.java:348)
	at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
	... 21 more

16:13:12.429 INFO org.springframework.batch.core.step.AbstractStep	: Step: [scan-reader] executed in 46m34s414ms
16:13:12.445 INFO org.springframework.batch.core.launch.support.SimpleJobLauncher	: Job: [SimpleJob: [name=scan-reader]] completeScanning  34% ━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  4792300/13695788 (0:46:35 / 1:26:32) 1714.6/s
16:13:12.964 INFO org.springframework.batch.core.step.AbstractStep	: Step: [snapshot-replication] executed in 46m36s439ms
16:13:13.068 INFO org.springframework.batch.core.job.SimpleStepHandler	: Executing step: [verification]

@hunkjun
Copy link

hunkjun commented Sep 27, 2023

Hello,

I'm running the following command to replicate from 1 Elasticache cluster to another: JAVA_OPTS="-Xmx64g" ./riot-redis -i --stacktrace -h source.cache.amazonaws.com -p 6379 replicate -h target.cache.amazonaws.com -p 6379 --batch 1

Source cluster INFO output:

# Server
redis_version:3.2.4
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:0
redis_mode:standalone
os:Amazon ElastiCache
arch_bits:64
multiplexing_api:epoll
gcc_version:0.0.0
process_id:1
run_id:8cd8a940be86564d937725bc7dc967c9e51a94d7
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
tcp_port:6379
uptime_in_seconds:39663067
uptime_in_days:459
hz:10
lru_clock:4366755
total_commands_processed:2756363612
instantaneous_ops_per_sec:3
total_net_input_bytes:195785379317
total_net_output_bytes:668172690832
instantaneous_input_kbps:0.12
instantaneous_output_kbps:4.46
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:904962796
keyspace_misses:8286
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
role:slave
master_host:source.cache.amazonaws.com
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

# Stats
total_connections_received:1463228
total_commands_processed:2756363612
instantaneous_ops_per_sec:3
total_net_input_bytes:195785379317
total_net_output_bytes:668172690832
instantaneous_input_kbps:0.12
instantaneous_output_kbps:4.46
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:904962796
keyspace_misses:8286
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
role:slave
master_host:source.cache.amazonaws.com
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:623807091890
repl_sync_enabled:1
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:17899.54
used_cpu_user:49092.53
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=63463817,expires=33322835,avg_ttl=0
db5:keys=45502,expires=1812,avg_ttl=0

Target cluster INFO output:

# Server
redis_version:6.2.6
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:0
redis_mode:standalone
os:Amazon ElastiCache
arch_bits:64
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:0.0.0
process_id:1
run_id:1317126a7e9e3af4e759006a6aaafc43fa0913f5
tcp_port:6379
server_time_usec:1682088327009544
uptime_in_seconds:217224
uptime_in_days:2
hz:10
configured_hz:10
lru_clock:4366726
executable:-
config_file:-

# Clients
connected_clients:7
cluster_connections:0
maxclients:65000
client_recent_max_input_buffer:32
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

# Memory
used_memory:5273747832
used_memory_human:4.91G
used_memory_rss:5364830208
used_memory_rss_human:5.00G
used_memory_peak:5387511200
used_memory_peak_human:5.02G
used_memory_peak_perc:97.89%
used_memory_overhead:1478635424
used_memory_startup:9306856
used_memory_dataset:3795112408
used_memory_dataset_perc:72.09%
allocator_allocated:5278263360
allocator_active:5279645696
allocator_resident:5363998720
used_memory_lua:33792
used_memory_lua_human:33.00K
master_repl_offset:240700495402
second_repl_offset:157087304785
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:240699446827
repl_backlog_histlen:1048576

# CPU
used_cpu_sys:18637.736549
used_cpu_user:27639.798232
used_cpu_sys_children:2.001466
used_cpu_user_children:13.251697
used_cpu_sys_main_thread:252.775360
used_cpu_user_main_thread:14353.886320

# Modules

# Errorstats
errorstat_ERR:count=2387

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=20264055,expires=10620831,avg_ttl=1481086241362

The process gets stuck at 31% and never completes. Typical message is as follows:

14:35:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264349/63458641 (11:16:18 / 14:36:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264350/63458641 (11:17:18 / 14:37:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264351/63458641 (11:18:18 / 14:38:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264352/63458641 (11:19:18 / 14:39:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264353/63458641 (11:20:18 / 14:40:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264354/63458641 (11:21:18 / 14:41:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264355/63458641 (11:22:18 / 14:42:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264356/63458641 (11:23:18 / 14:43:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264357/63458641 (11:24:18 / 14:44:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264358/63458641 (11:25:18 / 24:20:45) 492.8/s
14:45:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264359/63458641 (11:26:18 / 24:22:53) 492.1/s
14:46:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264360/63458641 (11:27:18 / 24:25:01) 491.4/s
14:47:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264361/63458641 (11:28:18 / 24:27:09) 490.7/s
14:48:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264362/63458641 (11:29:18 / 24:29:17) 490.0/s
14:49:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264363/63458641 (11:30:18 / 24:31:25) 489.3/s
14:50:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264364/63458641 (11:31:18 / 24:33:33) 488.6/s
14:51:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264365/63458641 (11:32:18 / 24:35:40) 487.9/s
14:52:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
14:52:09.373 INFO io.lettuce.core.protocol.ReconnectionHandler	: Reconnected to target.cache.amazonaws.com:6379

Sometimes I see the following stacktrace error:

15:29:04.475 SEVERE org.springframework.batch.core.step.AbstractStep    : Encountered an error executing step scan-reader in job scan-reader
org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.; nested exception is java.util.concurrent.TimeoutException
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$5.recover(FaultTolerantChunkProcessor.java:429)
        at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:539)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:387)
        at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:255)
        at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:444)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
        at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
        at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
        at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
        at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:375)
        at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
        at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:145)
        at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:258)
        at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
        at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
        at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413)
        at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136)
        at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
        at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException
        at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
        at com.redis.spring.batch.reader.KeyDumpValueReader.read(KeyDumpValueReader.java:42)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:77)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:27)
        at com.redis.spring.batch.RedisItemReader$Enqueuer.write(RedisItemReader.java:151)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$3.doWithRetry(FaultTolerantChunkProcessor.java:348)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
        ... 21 more

15:29:04.475 INFO org.springframework.batch.core.step.AbstractStep      : Step: [scan-reader] executed in 1h34m57s282ms

During live replication (--mode live), I'm getting the following error:

17:47:45.573 WARNING com.redis.spring.batch.reader.KeyspaceNotificationItemReader       : Could not add key because queue is full
17:47:46.145 SEVERE org.springframework.batch.core.step.AbstractStep    : Encountered an error executing step live-reader in job live-reader
org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.; nested exception is java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$5.recover(FaultTolerantChunkProcessor.java:429)
        at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:539)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:387)
        at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:255)
        at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:444)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
        at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
        at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
        at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
        at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:375)
        at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
        at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:145)
        at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:258)
        at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
        at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
        at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413)
        at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136)
        at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
        at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022)
        at com.redis.spring.batch.reader.KeyDumpValueReader.read(KeyDumpValueReader.java:42)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:77)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:27)
        at com.redis.spring.batch.RedisItemReader$Enqueuer.write(RedisItemReader.java:151)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$3.doWithRetry(FaultTolerantChunkProcessor.java:348)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
        ... 21 more
Caused by: java.lang.OutOfMemoryError: Java heap space

Source & Target cluster both have 2 nodes in the replication group - each one with 16 CPUs & 64GB RAM. Replication instance has 32 CPUs & 128GB RAM.

I tried with many-many options (played around with different --batch sizing, added --no-verify, --skip-policy=ALWAYS, multiple threads) - however nothing is working. Wondering if you can take a look and help me.

Thank you, Nodir

In my environment, I also encountered the same problem as you. Through testing, I found that the reason is that there are hot keys and large keys in the source DB, which makes riot unable to normally listen ; This is why OOM;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants