Error: Rpc(Transport(Error in the WebSocket handshake: i/o error: unexpected end of file #22

bernardoaraujor · 2022-05-25T19:22:18Z

while testing on kubernetes mode, I am facing the following errors:

$ ./zombienet-linux -p kubernetes test tests/stps/relay.feature
...
    ✔ alice: is up within 600 secs (73ms)
[2022-05-25T18:53:22Z INFO  utils] Checking sTPS pre-conditions (account nonces and free balances).
rust process exited with code 0
    ✔ alice: js-script ./utils.js with "check_pre_conditions" return is 0 within 600 secs (533ms)
[2022-05-25T18:53:22Z INFO  utils] Node 0: Reading funded accounts from: "tests/stps/funded-accounts.json"
[2022-05-25T18:53:22Z INFO  utils] Node 2: Reading funded accounts from: "tests/stps/funded-accounts.json"
[2022-05-25T18:53:23Z INFO  utils] Node 1: Reading funded accounts from: "tests/stps/funded-accounts.json"
[2022-05-25T18:53:23Z INFO  utils] Node 3: Reading funded accounts from: "tests/stps/funded-accounts.json"
Error: Rpc(Transport(Error in the WebSocket handshake: i/o error: unexpected end of file

Caused by:
    0: i/o error: unexpected end of file
    1: unexpected end of file))
[2022-05-25T18:53:23Z INFO  utils::sender] Node 0: signing 4096 transactions
rust process exited with code 1
child process exited
[2022-05-25T18:53:23Z INFO  utils::sender] Node 2: signing 4096 transactions
[2022-05-25T18:53:23Z INFO  utils::sender] Node 1: signing 4096 transactions
[2022-05-25T18:54:14Z INFO  utils::sender] Node 0: sending 4096 transactions in chunks of 50
[2022-05-25T18:54:15Z INFO  utils::sender] Node 1: sending 4096 transactions in chunks of 50
[2022-05-25T18:54:15Z INFO  utils::sender] Node 2: sending 4096 transactions in chunks of 50
[2022-05-25T18:54:15Z INFO  utils::sender] Node 0: 350 txs sent in 1172 ms (298.38 /s)
[2022-05-25T18:54:16Z INFO  utils::sender] Node 1: 300 txs sent in 1060 ms (283.01 /s)
[2022-05-25T18:54:16Z INFO  utils::sender] Node 2: 200 txs sent in 1409 ms (141.94 /s)
[2022-05-25T18:54:17Z INFO  utils::sender] Node 1: 600 txs sent in 1035 ms (579.69 /s)
[2022-05-25T18:54:17Z INFO  utils::sender] Node 2: 400 txs sent in 1069 ms (374.12 /s)

A few patterns that caught my attention:

this does not happen on native mode.
the error message happens right after utils starts reading funded accounts from: tests/stps/funded-accounts.json, although there's no RPC call in this process... it's just loading a json file locally.
the error message happens before any utils::sender starts signing the transactions to be dispatched, which is way before any actual RPC call happens.
there's one utils::sender per node. As I increase the number of nodes in the setup, it's always the last utils::sender that dies (e.g.: 5 nodes, no txs are dispatched for node 4)

@niklasad1 does this look like anything related to jsonrpsee to you?
all the points listed above tell me it's not, but maybe your experienced eyes can rule this out from a different perspecitve.

cc @ggwpez @pepoviola

The text was updated successfully, but these errors were encountered:

ggwpez · 2022-05-25T20:01:24Z

Reproduced it with 0.9.22 in local KIND cluster.
Needs investigation.

niklasad1 · 2022-05-25T20:20:24Z

Yes, it'a a jsonrpsee error for sure, the WebSocket handshake failed but I can't tell why without more information.
How many connections are you guys opening to the server?

You can replace env_logger with tracing-subscriber to get the logs from jsonrpsee on the client side.

Would be good get some logs from polkadot node, it looks weird with EOF error in the handshake FWIW

ggwpez · 2022-05-25T21:55:00Z

You can replace env_logger with tracing-subscriber to get the logs from jsonrpsee on the client side.

Do we have to modify the source for that? It uses a docker image and such...
Is there no -l rpc=trace or equivalent? @niklasad1

bernardoaraujor · 2022-05-25T22:07:35Z

You can replace env_logger with tracing-subscriber to get the logs from jsonrpsee on the client side.

Do we have to modify the source for that? It uses a docker image and such...
Is there no -l rpc=trace or equivalent? @niklasad1

I think niklas meant using tracing-subscriber on our utils code

there's no docker involved in that

niklasad1 · 2022-05-25T22:41:45Z

I think niklas meant using tracing-subscriber on our utils code

Yes, it's one line in the utils code

On the polkadot node side you could do:

./polkadot -l "rpc=trace,jsonrpsee=debug"

However, I think Wireshark is the way to go as this setup seems quite complicated and probably performs huge number of calls.

niklasad1 · 2022-05-26T07:54:28Z

@bernardoaraujor

Those nodes are not even running jsonrpsee, it's running v0.9.21 i.e, the old jsornpc library.
Can you please update them and then I can have a look at the logs?

I guess if you guys could provide some instructions how to run this locally then I could have a look

bernardoaraujor · 2022-05-26T11:29:12Z

@bernardoaraujor

Those nodes are not even running jsonrpsee, it's running v0.9.21 i.e, the old jsornpc library. Can you please update them and then I can have a look at the logs?

I guess if you guys could provide some instructions how to run this locally then I could have a look

damn, sorry about that!

here's another set.
this time, only node 2 failed.

there's two ways to reproduce: with zombienet's native mode or kubernetes mode.

doing ./zombienet.sh should bootstrap the project.
in case you want to try kubernetes mode, first you need to ask @pepoviola to give you permissions on the zombienet cluster and run gcloud container clusters get-credentials parity-zombienet --zone europe-west3-b --project parity-zombienet

then you do ./zombienet-linux -p kubernetes test tests/stps/relay.feature

bernardoaraujor · 2022-05-26T11:37:12Z

another relevant point is that if this happens both with 0.9.21 and 0.9.22 maybe it's not really jsonrpsee?

niklasad1 · 2022-05-26T11:50:56Z

another relevant point is that if this happens both with 0.9.21 and 0.9.22 maybe it's not really jsonrpsee?

Yeah it at least makes me calmer :)

The logs are pretty much useless on the node side so you can turn them off, it could be some nginx/proxy setting or something that rejects the connection too. It never reaches the jsonrpc code servers as I can see both in v0.9.21 or v0.9.22.

I pretty much know nothing about the zombienet or kubernetes :)

On the subxt side you guys could try to open the websocket connection like this in a loop with some sleeps to try it a couple of times before killing the node but again if you find some logs from subxt/jsonrpsee on the client side that would be helpful

ggwpez · 2022-05-26T17:20:28Z

The RPC trace log is blank for the node that does not get the connection. Therefore it could also be the kubernetes/docker stuff which does not correctly forward the port.
To me it looks like the port of the container is not open.
both logs:
bad.log
good.txt

PS: I tried again with a python script to connect to the IPs printed in console.
For stps-0 and 1 this works, but for stps-2 not. So this is a problem with zombienet probably.

pepoviola · 2022-05-26T18:00:35Z

PS: I tried again with a python script to connect to the IPs printed in console. For stps-0 and 1 this works, but for stps-2 not. So this is a problem with zombienet probably.

Hi @ggwpez, I will double check this. You can't connect to the rpc endpoint?

bernardoaraujor · 2022-05-26T19:03:09Z

on the rust code, the error is originating on the await? from this line, which is the instantiation of the subxt ClientBuilder for utils::sender module:

	let api = ClientBuilder::new()
		.set_url(node.clone())
		.build()
		.await?
		.to_runtime_api::<runtime::RuntimeApi<DefaultConfig, PolkadotExtrinsicParams<DefaultConfig>>>(
		);

niklasad1 · 2022-05-26T19:39:01Z

Yes, that's when connecting the RPC endpoint under the hood in subxt.

so if you could try something like this instead to try connect a couple of times if the RPC endpoint is busy or something:

let mut i = 0;

let client = loop {
       let err = match WsClientBuilder::default() {
         .connection_timeout(std::time::Duration::from_secs(60)).
         .build(&node)
         .await {
           Ok(c) => break c,
           Err(e) => e
        };
        
       i += 1;
       
      // what you think is a reasonable limit
      if i > 10 {
        return Err(err.into()); 
      }
      tokio::time::sleep(std::time::Duration::from_secs(1)).await;
}

let api = ClientBuilder::new()
		.set_client(client)
		.build()
		.await?
		.to_runtime_api::<runtime::RuntimeApi<DefaultConfig, PolkadotExtrinsicParams<DefaultConfig>>>(
		);

But as it doesn't work for Oliver in python either there is mostly something flaky with the RPC endpoint configuration.

ggwpez · 2022-05-26T20:08:36Z

so if you could try something like this instead to try connect a couple of times if the RPC endpoint is busy or something:

Yea I added a retry loop. First time it gets the EOF error and then just "connection refused". Nothing in the node log.

bernardoaraujor · 2022-05-30T15:46:23Z

You can replace env_logger with tracing-subscriber to get the logs from jsonrpsee on the client side.

I modified the sender like this:

 async fn main() -> Result<(), Box<dyn std::error::Error>> {
-       env_logger::init_from_env(
-               env_logger::Env::default().filter_or(env_logger::DEFAULT_FILTER_ENV, "info"),
-       );
+       tracing_subscriber::fmt().with_env_filter(tracing_subscriber::EnvFilter::from_default_env()).init();
 
        let cli = Cli::parse();
        match cli.command {

and then called zombienet with:

$ RUST_LOG="jsonrpsee=trace,info,subxt=debug,soketto=trace" ./zombienet-linux -p kubernetes test  tests/stps/relay.feature

got these huge logs:
logs.txt

cc @niklasad1

niklasad1 · 2022-05-30T16:43:11Z

@pepoviola

How are you fetching the WS ports from the "polkadot nodes" to connect to?

From the logs I can see that old nodes are responsive except ws://127.0.0.1:38049 which is surprising, the node itself says :9944 but I guess it's behind something else.

It would be good to know if the actual port is open, like doing netstat or nmap port scanning :)

pepoviola · 2022-05-30T16:46:56Z

Hi @niklasad1,
Zombienet open a port-forwarding between the pod port (9944) and the process using kubectl, that why you see random ports. I try on my end and it keep looping with this logs

[2022-05-30T16:34:32Z TRACE soketto] read 2 bytes
[2022-05-30T16:34:32Z TRACE soketto::connection] 726989bd: recv: (Text (fin 1) (rsv 000) (mask (0 0)) (len 37))
[2022-05-30T16:34:32Z TRACE soketto::connection] 726989bd: send: (Text (fin 1) (rsv 000) (mask (1 ad7145e8)) (len 123))
[2022-05-30T16:34:32Z TRACE soketto::connection] 726989bd: Sender flushing connection

Also, I just try to connect using polkadotjs and works

Could be the case that the port-forwarding tunnel is down but should be a log about that.

Thanks!

ggwpez · 2022-05-30T16:50:27Z

@pepoviola can I somehow print all port forwardings with the kubectl that are currently active?

pepoviola · 2022-05-30T17:12:30Z

@pepoviola can I somehow print all port forwardings with the kubectl that are currently active?

Yes, @ggwpez you can run this command in another terminal

ps -f | grep 'kubectl' | grep 'port-forward' | awk '{print "PID:" $2 " " $10 " " $11}'

And you get something like this

PID:43198 Pod/alice :9944
PID:43204 Pod/alice :9615
PID:43277 Pod/stps-0 :9944
PID:43280 Pod/stps-0 :9615
PID:43286 service/tempo-tempo-distributed-query-frontend :3100

And then you can see the open port for each PID

lsof -p <PID> |grep TCP |grep LISTEN

I can write an small bash script if you needed.
Thanks!

bernardoaraujor · 2022-06-01T13:27:03Z

I'm investigating from a different perspective.
I spawn a network with: $ ./zombienet-linux -p kubernetes spawn tests/stps/relay.json

Then I manually fill the PORT* variables (taken from zombienet's console logs) and execute the following shell script:

PORT0=34253
PORT1=43669
PORT2=35933

./utils/target/release/utils send-balance-transfers --node-url ws://127.0.0.1:$PORT0 --node-index 0 --total-nodes 3 &
./utils/target/release/utils send-balance-transfers --node-url ws://127.0.0.1:$PORT1 --node-index 1 --total-nodes 3 &
./utils/target/release/utils send-balance-transfers --node-url ws://127.0.0.1:$PORT2 --node-index 2 --total-nodes 3 &

the error never happens on this approach, which indicates that there's something about the way zombienet is executing the js-script feature with multiple nodes.

bernardoaraujor · 2022-06-10T13:18:32Z

@pepoviola sorry it seems 1.2.39 still hasn't solved the issue.

this includes the latest nonce changes by @ggwpez

    ✔ alice: is up within 600 secs (82ms)
[2022-06-10T13:03:40Z INFO  utils] Checking sTPS pre-conditions (account nonces and free balances).
rust process exited with code 0
    ✔ alice: js-script ./utils.js with "check_pre_conditions" return is 0 within 600 secs (667ms)
[2022-06-10T13:03:41Z INFO  utils] Node 1: Reading funded accounts from: "tests/stps/funded-accounts.json"
Error: Rpc(Transport(Error in the WebSocket handshake: i/o error: unexpected end of file

Caused by:
    0: i/o error: unexpected end of file
    1: unexpected end of file))
rust process exited with code 1
child process exited
creating new port-fw for Pod/stps-1, with map 37347:9944
[2022-06-10T13:03:41Z INFO  utils] Node 0: Reading funded accounts from: "tests/stps/funded-accounts.json"
[2022-06-10T13:03:41Z INFO  utils] Node 2: Reading funded accounts from: "tests/stps/funded-accounts.json"
Error: Rpc(Transport(Error in the WebSocket handshake: i/o error: unexpected end of file

Caused by:
    0: i/o error: unexpected end of file
    1: unexpected end of file))
rust process exited with code 1
child process exited
creating new port-fw for Pod/stps-2, with map 38287:9944
[2022-06-10T13:03:41Z INFO  utils::sender] Node 0: signing 5461 transactions
[2022-06-10T13:03:51Z INFO  utils::sender] Node 0: sending 5461 transactions in chunks of 50
[2022-06-10T13:03:52Z INFO  utils::sender] Node 0: 850 txs sent in 1016 ms (836.18 /s)
[2022-06-10T13:03:53Z INFO  utils::sender] Node 0: 900 txs sent in 1002 ms (898.01 /s)
[2022-06-10T13:03:54Z INFO  utils::sender] Node 0: 900 txs sent in 1016 ms (885.66 /s)

ggwpez · 2022-06-10T13:28:33Z

Okay thanks for testing @bernardoaraujor !
One thing that I forgot to add is the automatic re-connect logic.
I will push that later. AFAIK @pepoviola is now re-starting the port forwarding if it broke, so we just need to add re-connection logic. I have it locally but not yet pushed.

bernardoaraujor · 2022-06-13T20:24:58Z

although #28 was merged, the problem remains

pepoviola · 2022-06-13T20:50:09Z

although #28 was merged, the problem remains

Hi @bernardoaraujor, I just run the test and works as expected.

[2022-06-13T20:46:23Z INFO  utils] Calculating TPS on finalized blocks.
[2022-06-13T20:46:23Z INFO  utils::shared] Attempt #1: Connecting to ws://127.0.0.1:49941
[2022-06-13T20:46:30Z INFO  utils::tps] TPS on block 7: 143.2374
[2022-06-13T20:46:31Z INFO  utils::tps] TPS on block 8: 412.1081
[2022-06-13T20:46:32Z INFO  utils::tps] TPS on block 9: 299.63297
[2022-06-13T20:46:33Z INFO  utils::tps] TPS on block 10: 269.73395
[2022-06-13T20:46:35Z INFO  utils::tps] TPS on block 13: 148.16219
[2022-06-13T20:46:36Z INFO  utils::tps] TPS on block 14: 156.10036
[2022-06-13T20:46:37Z INFO  utils::tps] TPS on block 15: 147.37903
[2022-06-13T20:46:37Z INFO  utils::tps] TPS on block 16: 163.18878
[2022-06-13T20:46:38Z INFO  utils::tps] TPS on block 17: 179.31447
[2022-06-13T20:46:39Z INFO  utils::tps] TPS on block 18: 397.03125
[2022-06-13T20:46:40Z INFO  utils::tps] TPS on block 19: 402.2564
[2022-06-13T20:46:40Z INFO  utils::tps] average TPS: 247.10406
rust process exited with code 0
    ✔ alice: js-script ./utils.js with "calculate_tps,3" return is 0 within 600 secs (17943ms)

	 Node's logs are available in /var/folders/rz/1cyx7hfj31qgb98d8_cg7jwh0000gn/T/zombie-5d5238ed6350cd19da0404799c8fc9dd_-17306-UgKICx49LC80/logs

	 Deleting network


  4 passing (9m)

But I think, and I talk about this with @ggwpez that moving the sender to a pod in the cluster will remove all the friction that we are having here. I don't make much progress yet, but I will keep posting here the progress.

Thanks!

bernardoaraujor · 2022-06-13T20:58:08Z

Hi @bernardoaraujor, I just run the test and works as expected.

I got it to work a few times too, but it seemed to be just luck. If I tried repeatedly, it actually fails more frequently than not.
Are you successful on every attempt?

But I think, and I talk about this with @ggwpez that moving the sender to a pod in the cluster will remove all the friction that we are having here. I don't make much progress yet, but I will keep posting here the progress.

awesome!

bernardoaraujor · 2022-10-27T17:08:02Z

after many repeated tests on v0.9.30, I noticed this issue completely disappeared.

so I incrementally increased every version from v0.9.22 (the last version we tested a few months ago) until v0.9.30 and found out that the error is present on v0.9.26 and disappears from v0.9.27 onwards.

I'm tempted to close this issue as the problem is gone, but it would be interesting to have some theories on what could have been the root cause.

@niklasad1 does this info on the versions bring any insights? were there any meaningful changes on jsonrpsee between v0.9.26 and 0.9.27?

niklasad1 · 2022-10-27T20:56:08Z

@niklasad1 does this info on the versions bring any insights? were there any meaningful changes on jsonrpsee between v0.9.26 and 0.9.27?

The major thing is that v0.9.27 fixed a CORS bug in the jsonrpsee-ws-server nothing else really sticks out.
Great 👍

bernardoaraujor · 2023-01-05T22:07:12Z

reopening as my investigation in October was a false alarm.

issue persists, it's just not reproducible in a deterministic way.

bernardoaraujor · 2023-01-06T23:29:04Z

@pepoviola about this:

But I think, and I talk about this with @ggwpez that moving the sender to a pod in the cluster will remove all the friction that we are having here. I don't make much progress yet, but I will keep posting here the progress.

Is this already possible on zombienet? I think this is the only viable solution.

This comment was marked as outdated.

Sign in to view

pepoviola mentioned this issue Jun 2, 2022

Remove extra await paritytech/zombienet#212

Merged

bernardoaraujor closed this as completed Oct 28, 2022

bernardoaraujor reopened this Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: Rpc(Transport(Error in the WebSocket handshake: i/o error: unexpected end of file #22

Error: Rpc(Transport(Error in the WebSocket handshake: i/o error: unexpected end of file #22

bernardoaraujor commented May 25, 2022 •

edited

ggwpez commented May 25, 2022

niklasad1 commented May 25, 2022 •

edited

ggwpez commented May 25, 2022

bernardoaraujor commented May 25, 2022

niklasad1 commented May 25, 2022

This comment was marked as outdated.

niklasad1 commented May 26, 2022 •

edited

bernardoaraujor commented May 26, 2022 •

edited

bernardoaraujor commented May 26, 2022

niklasad1 commented May 26, 2022 •

edited

ggwpez commented May 26, 2022 •

edited

pepoviola commented May 26, 2022

bernardoaraujor commented May 26, 2022

niklasad1 commented May 26, 2022 •

edited

ggwpez commented May 26, 2022

bernardoaraujor commented May 30, 2022

niklasad1 commented May 30, 2022

pepoviola commented May 30, 2022 •

edited

ggwpez commented May 30, 2022

pepoviola commented May 30, 2022

bernardoaraujor commented Jun 1, 2022 •

edited

bernardoaraujor commented Jun 10, 2022

ggwpez commented Jun 10, 2022

bernardoaraujor commented Jun 13, 2022

pepoviola commented Jun 13, 2022

bernardoaraujor commented Jun 13, 2022 •

edited

bernardoaraujor commented Oct 27, 2022

niklasad1 commented Oct 27, 2022 •

edited

bernardoaraujor commented Jan 5, 2023

bernardoaraujor commented Jan 6, 2023

Error: Rpc(Transport(Error in the WebSocket handshake: i/o error: unexpected end of file #22

Error: Rpc(Transport(Error in the WebSocket handshake: i/o error: unexpected end of file #22

Comments

bernardoaraujor commented May 25, 2022 • edited

ggwpez commented May 25, 2022

niklasad1 commented May 25, 2022 • edited

ggwpez commented May 25, 2022

bernardoaraujor commented May 25, 2022

niklasad1 commented May 25, 2022

This comment was marked as outdated.

niklasad1 commented May 26, 2022 • edited

bernardoaraujor commented May 26, 2022 • edited

bernardoaraujor commented May 26, 2022

niklasad1 commented May 26, 2022 • edited

ggwpez commented May 26, 2022 • edited

pepoviola commented May 26, 2022

bernardoaraujor commented May 26, 2022

niklasad1 commented May 26, 2022 • edited

ggwpez commented May 26, 2022

bernardoaraujor commented May 30, 2022

niklasad1 commented May 30, 2022

pepoviola commented May 30, 2022 • edited

ggwpez commented May 30, 2022

pepoviola commented May 30, 2022

bernardoaraujor commented Jun 1, 2022 • edited

bernardoaraujor commented Jun 10, 2022

ggwpez commented Jun 10, 2022

bernardoaraujor commented Jun 13, 2022

pepoviola commented Jun 13, 2022

bernardoaraujor commented Jun 13, 2022 • edited

bernardoaraujor commented Oct 27, 2022

niklasad1 commented Oct 27, 2022 • edited

bernardoaraujor commented Jan 5, 2023

bernardoaraujor commented Jan 6, 2023

bernardoaraujor commented May 25, 2022 •

edited

niklasad1 commented May 25, 2022 •

edited

niklasad1 commented May 26, 2022 •

edited

bernardoaraujor commented May 26, 2022 •

edited

niklasad1 commented May 26, 2022 •

edited

ggwpez commented May 26, 2022 •

edited

niklasad1 commented May 26, 2022 •

edited

pepoviola commented May 30, 2022 •

edited

bernardoaraujor commented Jun 1, 2022 •

edited

bernardoaraujor commented Jun 13, 2022 •

edited

niklasad1 commented Oct 27, 2022 •

edited