Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Valkey Over RDMA transport #477

Open
wants to merge 2 commits into
base: unstable
Choose a base branch
from

Conversation

pizhenwei
Copy link

Hi,

Since 2021/06, I created a PR for Redis Over RDMA proposal. Then I did some work to fully abstract connection and make TLS dynamically loadable, a new connection type could be built into Redis statically, or a separated shared library(loaded by Redis on startup) since Redis 7.2.0.

Base on the new connection framework, I created a new PR, some guys(@xiezhq-hermann @zhangyiming1201 @JSpewock @uvletter @FujiZ) noticed, played and tested this PR. However, because of the lack of time and knowledge from the maintainers, this PR has been pending about 2 years.

Changes in this PR:

  • introduce Valkey Over RDMA specification. (same as Redis, and this should be same)
  • implement Valkey Over RDMA. (compact the Valkey style)

Finally, if this feature is considered to merge, I volunteer to maintain it.

pizhenwei and others added 2 commits May 9, 2024 10:38
RDMA is the abbreviation of remote direct memory access. It is a
technology that enables computers in a network to exchange data in
the main memory without involving the processor, cache, or operating
system of either computer. This means RDMA has a better performance
than TCP, the test results show Valkey Over RDMA has a ~2.5X QPS and
lower latency.

In recent years, RDMA gets popular in the data center, especially
RoCE(RDMA over Converged Ethernet) architecture has been widely used.

Introduce Valkey Over RDMA protocol as a new transport for Valkey. For
now, we defined 4 commands:
- GetServerFeature & SetClientFeature: the two commands are used to
  negotiate features for further extension. There is no feature
  definition in this version. Flow control and multi-buffer may be
  supported in the future, this needs feature negotiation.
- Keepalive
- RegisterXferMemory: the heart to transfer the real payload.

The 'TX buffer' and 'RX buffer' are designed by RDMA remote memory
with RDMA write/write with imm, it's similar to several mechanisms
introduced by papers(but not same):
- Socksdirect: datacenter sockets can be fast and compatible
  <https://dl.acm.org/doi/10.1145/3341302.3342071>
- LITE Kernel RDMA Support for Datacenter Applications
  <https://dl.acm.org/doi/abs/10.1145/3132747.3132762>
- FaRM: Fast Remote Memory
  <https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf>

Co-authored-by: Xinhao Kong <xinhao.kong@duke.edu>
Co-authored-by: Huaping Zhou <zhouhuaping.san@bytedance.com>
Co-authored-by: zhuo jiang <jiangzhuo.cs@bytedance.com>
Co-authored-by: Yiming Zhang <zhangyiming1201@bytedance.com>
Co-authored-by: Jianxi Ye <jianxi.ye@bytedance.com>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Main changes in this patch:
* implement *Valkey Over RDMA* protocol, see *Protocol* section in RDMA.md
* implement server side of connection module only, this means we can *NOT*
  compile RDMA support as built-in.
* add necessary information in RDMA.md
* support 'CONFIG SET/GET', for example, 'CONFIG Set rdma.port 6380', then
  check this by 'rdma res show cm_id' and valkey-cli(with RDMA support,
  but not implemented in this patch)
* the full listeners show like():
    listener0:name=tcp,bind=*,bind=-::*,port=6379
    listener1:name=unix,bind=/var/run/valkey.sock
    listener2:name=rdma,bind=xx.xx.xx.xx,bind=yy.yy.yy.yy,port=6379
    listener3:name=tls,bind=*,bind=-::*,port=16379

valgrind test works fine:
valgrind --track-origins=yes --suppressions=./src/valgrind.sup
         --show-reachable=no --show-possibly-lost=no --leak-check=full
         --log-file=err.txt ./src/valkey-server --port 6379
         --loadmodule src/valkey-rdma.so port=6379 bind=xx.xx.xx.xx
         --loglevel verbose --protected-mode no --server_cpulist 2
         --bio_cpulist 3 --aof_rewrite_cpulist 3 --bgsave_cpulist 3
         --appendonly no

performance test:
server side: ./src/valkey-server --port 6379 # TCP port 6379 has no conflict with RDMA port 6379
             --loadmodule src/valkey-rdma.so port=6379 bind=xx.xx.xx.xx bind=yy.yy.yy.yy
             --loglevel verbose --protected-mode no --server_cpulist 2 --bio_cpulist 3
             --aof_rewrite_cpulist 3 --bgsave_cpulist 3 --appendonly no

build a valkey-benchmark with RDMA support(not implemented in this patch), run
on a x86(Intel Platinum 8260) with RoCEv2 interface(Mellanox ConnectX-5):
client side: ./src/valkey-benchmark -h xx.xx.xx.xx -p 6379 -c 30 -n 10000000 --threads 4
             -d 1024 -t ping,get,set --rdma
====== PING_INLINE ======
480561.28 requests per second, 0.060 msec avg latency.

====== PING_MBULK ======
540482.06 requests per second, 0.053 msec avg latency.

====== SET ======
399952.00 requests per second, 0.073 msec avg latency.

====== GET ======
443498.31 requests per second, 0.065 msec avg latency.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Copy link

codecov bot commented May 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.81%. Comparing base (6cff0d6) to head (36bd4e5).
Report is 9 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #477      +/-   ##
============================================
+ Coverage     68.91%   69.81%   +0.90%     
============================================
  Files           109      109              
  Lines         61792    61792              
============================================
+ Hits          42581    43138     +557     
+ Misses        19211    18654     -557     

see 21 files with indirect coverage changes

@pizhenwei
Copy link
Author

This PR could be tested by client.

To build client with RDMA:

make BUILD_RDMA=yes -j16

To test by commands:

Config of redis: appendonly no, port 6379, rdma-port 6379, appendonly no,
                 server_cpulist 12, bgsave_cpulist 16.
For RDMA: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma \
	  --server_cpulist 2 --bio_cpulist 3 --aof_rewrite_cpulist 3 --bgsave_cpulist 3
For TCP: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

@madolson madolson added the major-decision-pending Major decision pending by TSC team label May 12, 2024
@hz-cheng
Copy link

hz-cheng commented May 20, 2024

Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Valkey over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.4xlarge, 16 vCPUs, 64G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP sockets, the use of RDMA can significantly enhance performance.

Test command of server side:

./src/valkey-server --port 6379 \
  --loadmodule src/valkey-rdma.so port=6380 bind=11.0.0.114 \
  --loglevel verbose --protected-mode no \
  --server_cpulist 12 --bgsave_cpulist 16 --appendonly no 

Test command of client side:

# Test for RDMA
./src/redis-benchmark -h 11.0.0.114 -p 6380 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma

# Test for TCP socket
./src/redis-benchmark -h 11.0.0.114 -p 6379 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

The performance test results are as shown in the following table. Apart from LRANGE_100 (performance improvement but not substantially), in other scenarios (PING, SET, GET) the throughput can be increased by at least 76%, and the average (AVG) and P99 latencies can be reduced by at least 40%.

RDMA TCP RDMA/TCP
PING_INLINE
Throughput 666577.81 366394.31 181.93%
Latency-AVG 0.044 0.08 55.00%
Latency-P99 0.063 0.127 49.61%
PING_MBULK
Throughput 688657.81 395397.56 174.17%
Latency-AVG 0.042 0.073 57.53%
Latency-P99 0.063 0.119 52.94%
SET
Throughput 434744.78 157726.22 275.63%
Latency-AVG 0.068 0.188 36.17%
Latency-P99 0.111 0.183 60.66%
GET
Throughput 562587.94 319478.59 176.10%
Latency-AVG 0.052 0.091 57.14%
Latency-P99 0.079 0.151 52.32%
LRANGE
Throughput 526260.38 211434.36 248.90%
Latency-AVG 0.056 0.14 40.00%
Latency-P99 0.079 0.159 49.69%
LRANGE_100
Throughput 57106.96 49498.34 115.37%
Latency-AVG 0.427 0.499 85.57%
Latency-P99 4.207 13.367 31.47%

@pizhenwei
Copy link
Author

Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Redis over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.4xlarge, 16 vCPUs, 64G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP sockets, the use of RDMA can significantly enhance performance.

Test command of server side:

./src/valkey-server --port 6379 \
  --loadmodule src/valkey-rdma.so port=6380 bind=11.0.0.114 \
  --loglevel verbose --protected-mode no \
  --server_cpulist 12 --bgsave_cpulist 16 --appendonly no 

Test command of client side:

# Test for RDMA
./src/redis-benchmark -h 11.0.0.114 -p 6380 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma

# Test for TCP socket
./src/redis-benchmark -h 11.0.0.114 -p 6379 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

The performance test results are as shown in the following table. Apart from LRANGE_100 (performance improvement but not substantially), in other scenarios (PING, SET, GET) the throughput can be increased by at least 76%, and the average (AVG) and P99 latencies can be reduced by at least 40%.

指标 RDMA TCP RDMA/TCP
PING_INLINE
Throughput 666577.81 366394.31 181.93%
Latency-AVG 0.044 0.08 55.00%
Latency-P99 0.063 0.127 49.61%
PING_MBULK
Throughput 688657.81 395397.56 174.17%
Latency-AVG 0.042 0.073 57.53%
Latency-P99 0.063 0.119 52.94%
SET
Throughput 434744.78 157726.22 275.63%
Latency-AVG 0.068 0.188 36.17%
Latency-P99 0.111 0.183 60.66%
GET
Throughput 562587.94 319478.59 176.10%
Latency-AVG 0.052 0.091 57.14%
Latency-P99 0.079 0.151 52.32%
LRANGE
Throughput 526260.38 211434.36 248.90%
Latency-AVG 0.056 0.14 40.00%
Latency-P99 0.079 0.159 49.69%
LRANGE_100
Throughput 57106.96 49498.34 115.37%
Latency-AVG 0.427 0.499 85.57%
Latency-P99 4.207 13.367 31.47%

Hi, @hz-cheng

I notice that you are the author of alibaba-cloud erdma driver for both linux kernel and rdma-core. Cooooooooool!

@hz-cheng
Copy link

hz-cheng commented May 21, 2024

More, If necessary, I could try reaching out to relevant colleagues to see if we can offer some Alibaba Cloud ECS instances to the community for free, so that the community can use and test Valkey over RDMA, as well as for future CI/CD purposes.

@baronwangr
Copy link

Is there a corresponding client that enables RDMA?

@pizhenwei
Copy link
Author

Is there a corresponding client that enables RDMA?

See this comment please.

@pizhenwei
Copy link
Author

Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Redis over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.4xlarge, 16 vCPUs, 64G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP sockets, the use of RDMA can significantly enhance performance.
Test command of server side:

./src/valkey-server --port 6379 \
  --loadmodule src/valkey-rdma.so port=6380 bind=11.0.0.114 \
  --loglevel verbose --protected-mode no \
  --server_cpulist 12 --bgsave_cpulist 16 --appendonly no 

Test command of client side:

# Test for RDMA
./src/redis-benchmark -h 11.0.0.114 -p 6380 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma

# Test for TCP socket
./src/redis-benchmark -h 11.0.0.114 -p 6379 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

The performance test results are as shown in the following table. Apart from LRANGE_100 (performance improvement but not substantially), in other scenarios (PING, SET, GET) the throughput can be increased by at least 76%, and the average (AVG) and P99 latencies can be reduced by at least 40%.
指标 RDMA TCP RDMA/TCP
PING_INLINE
Throughput 666577.81 366394.31 181.93%
Latency-AVG 0.044 0.08 55.00%
Latency-P99 0.063 0.127 49.61%
PING_MBULK
Throughput 688657.81 395397.56 174.17%
Latency-AVG 0.042 0.073 57.53%
Latency-P99 0.063 0.119 52.94%
SET
Throughput 434744.78 157726.22 275.63%
Latency-AVG 0.068 0.188 36.17%
Latency-P99 0.111 0.183 60.66%
GET
Throughput 562587.94 319478.59 176.10%
Latency-AVG 0.052 0.091 57.14%
Latency-P99 0.079 0.151 52.32%
LRANGE
Throughput 526260.38 211434.36 248.90%
Latency-AVG 0.056 0.14 40.00%
Latency-P99 0.079 0.159 49.69%
LRANGE_100
Throughput 57106.96 49498.34 115.37%
Latency-AVG 0.427 0.499 85.57%
Latency-P99 4.207 13.367 31.47%

Hi, @hz-cheng

I notice that you are the author of alibaba-cloud erdma driver for both linux kernel and rdma-core. Cooooooooool!

Hi @madolson ,
The feedback from the cloud vendor(alibaba-cloud) side shows the improvement, this means lots of end-user will enjoy it easily. Please let me know any concern about this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major-decision-pending Major decision pending by TSC team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants