Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Rpc(Transport(Error in the WebSocket handshake: i/o error: unexpected end of file #22

Open
bernardoaraujor opened this issue May 25, 2022 · 30 comments

Comments

@bernardoaraujor
Copy link
Contributor

bernardoaraujor commented May 25, 2022

while testing on kubernetes mode, I am facing the following errors:

$ ./zombienet-linux -p kubernetes test tests/stps/relay.feature
...
    ✔ alice: is up within 600 secs (73ms)
[2022-05-25T18:53:22Z INFO  utils] Checking sTPS pre-conditions (account nonces and free balances).
rust process exited with code 0
    ✔ alice: js-script ./utils.js with "check_pre_conditions" return is 0 within 600 secs (533ms)
[2022-05-25T18:53:22Z INFO  utils] Node 0: Reading funded accounts from: "tests/stps/funded-accounts.json"
[2022-05-25T18:53:22Z INFO  utils] Node 2: Reading funded accounts from: "tests/stps/funded-accounts.json"
[2022-05-25T18:53:23Z INFO  utils] Node 1: Reading funded accounts from: "tests/stps/funded-accounts.json"
[2022-05-25T18:53:23Z INFO  utils] Node 3: Reading funded accounts from: "tests/stps/funded-accounts.json"
Error: Rpc(Transport(Error in the WebSocket handshake: i/o error: unexpected end of file

Caused by:
    0: i/o error: unexpected end of file
    1: unexpected end of file))
[2022-05-25T18:53:23Z INFO  utils::sender] Node 0: signing 4096 transactions
rust process exited with code 1
child process exited
[2022-05-25T18:53:23Z INFO  utils::sender] Node 2: signing 4096 transactions
[2022-05-25T18:53:23Z INFO  utils::sender] Node 1: signing 4096 transactions
[2022-05-25T18:54:14Z INFO  utils::sender] Node 0: sending 4096 transactions in chunks of 50
[2022-05-25T18:54:15Z INFO  utils::sender] Node 1: sending 4096 transactions in chunks of 50
[2022-05-25T18:54:15Z INFO  utils::sender] Node 2: sending 4096 transactions in chunks of 50
[2022-05-25T18:54:15Z INFO  utils::sender] Node 0: 350 txs sent in 1172 ms (298.38 /s)
[2022-05-25T18:54:16Z INFO  utils::sender] Node 1: 300 txs sent in 1060 ms (283.01 /s)
[2022-05-25T18:54:16Z INFO  utils::sender] Node 2: 200 txs sent in 1409 ms (141.94 /s)
[2022-05-25T18:54:17Z INFO  utils::sender] Node 1: 600 txs sent in 1035 ms (579.69 /s)
[2022-05-25T18:54:17Z INFO  utils::sender] Node 2: 400 txs sent in 1069 ms (374.12 /s)

A few patterns that caught my attention:

  • this does not happen on native mode.
  • the error message happens right after utils starts reading funded accounts from: tests/stps/funded-accounts.json, although there's no RPC call in this process... it's just loading a json file locally.
  • the error message happens before any utils::sender starts signing the transactions to be dispatched, which is way before any actual RPC call happens.
  • there's one utils::sender per node. As I increase the number of nodes in the setup, it's always the last utils::sender that dies (e.g.: 5 nodes, no txs are dispatched for node 4)

@niklasad1 does this look like anything related to jsonrpsee to you?
all the points listed above tell me it's not, but maybe your experienced eyes can rule this out from a different perspecitve.

cc @ggwpez @pepoviola

@ggwpez
Copy link
Member

ggwpez commented May 25, 2022

Reproduced it with 0.9.22 in local KIND cluster.
Needs investigation.

@niklasad1
Copy link
Member

niklasad1 commented May 25, 2022

Yes, it'a a jsonrpsee error for sure, the WebSocket handshake failed but I can't tell why without more information.
How many connections are you guys opening to the server?

You can replace env_logger with tracing-subscriber to get the logs from jsonrpsee on the client side.

Would be good get some logs from polkadot node, it looks weird with EOF error in the handshake FWIW

@ggwpez
Copy link
Member

ggwpez commented May 25, 2022

You can replace env_logger with tracing-subscriber to get the logs from jsonrpsee on the client side.

Do we have to modify the source for that? It uses a docker image and such...
Is there no -l rpc=trace or equivalent? @niklasad1

@bernardoaraujor
Copy link
Contributor Author

You can replace env_logger with tracing-subscriber to get the logs from jsonrpsee on the client side.

Do we have to modify the source for that? It uses a docker image and such...
Is there no -l rpc=trace or equivalent? @niklasad1

I think niklas meant using tracing-subscriber on our utils code

there's no docker involved in that

@niklasad1
Copy link
Member

I think niklas meant using tracing-subscriber on our utils code

Yes, it's one line in the utils code

On the polkadot node side you could do:

./polkadot -l "rpc=trace,jsonrpsee=debug"

However, I think Wireshark is the way to go as this setup seems quite complicated and probably performs huge number of calls.

@bernardoaraujor

This comment was marked as outdated.

@niklasad1
Copy link
Member

niklasad1 commented May 26, 2022

@bernardoaraujor

Those nodes are not even running jsonrpsee, it's running v0.9.21 i.e, the old jsornpc library.
Can you please update them and then I can have a look at the logs?

I guess if you guys could provide some instructions how to run this locally then I could have a look

@bernardoaraujor
Copy link
Contributor Author

bernardoaraujor commented May 26, 2022

@bernardoaraujor

Those nodes are not even running jsonrpsee, it's running v0.9.21 i.e, the old jsornpc library. Can you please update them and then I can have a look at the logs?

I guess if you guys could provide some instructions how to run this locally then I could have a look

damn, sorry about that!

here's another set.
this time, only node 2 failed.

there's two ways to reproduce: with zombienet's native mode or kubernetes mode.

doing ./zombienet.sh should bootstrap the project.
in case you want to try kubernetes mode, first you need to ask @pepoviola to give you permissions on the zombienet cluster and run gcloud container clusters get-credentials parity-zombienet --zone europe-west3-b --project parity-zombienet

then you do ./zombienet-linux -p kubernetes test tests/stps/relay.feature

@bernardoaraujor
Copy link
Contributor Author

another relevant point is that if this happens both with 0.9.21 and 0.9.22 maybe it's not really jsonrpsee?

@niklasad1
Copy link
Member

niklasad1 commented May 26, 2022

another relevant point is that if this happens both with 0.9.21 and 0.9.22 maybe it's not really jsonrpsee?

Yeah it at least makes me calmer :)

The logs are pretty much useless on the node side so you can turn them off, it could be some nginx/proxy setting or something that rejects the connection too. It never reaches the jsonrpc code servers as I can see both in v0.9.21 or v0.9.22.

I pretty much know nothing about the zombienet or kubernetes :)

On the subxt side you guys could try to open the websocket connection like this in a loop with some sleeps to try it a couple of times before killing the node but again if you find some logs from subxt/jsonrpsee on the client side that would be helpful

@ggwpez
Copy link
Member

ggwpez commented May 26, 2022

The RPC trace log is blank for the node that does not get the connection. Therefore it could also be the kubernetes/docker stuff which does not correctly forward the port.
To me it looks like the port of the container is not open.
both logs:
bad.log
good.txt

PS: I tried again with a python script to connect to the IPs printed in console.
For stps-0 and 1 this works, but for stps-2 not. So this is a problem with zombienet probably.

@pepoviola
Copy link
Contributor

PS: I tried again with a python script to connect to the IPs printed in console. For stps-0 and 1 this works, but for stps-2 not. So this is a problem with zombienet probably.

Hi @ggwpez, I will double check this. You can't connect to the rpc endpoint?

@bernardoaraujor
Copy link
Contributor Author

on the rust code, the error is originating on the await? from this line, which is the instantiation of the subxt ClientBuilder for utils::sender module:

	let api = ClientBuilder::new()
		.set_url(node.clone())
		.build()
		.await?
		.to_runtime_api::<runtime::RuntimeApi<DefaultConfig, PolkadotExtrinsicParams<DefaultConfig>>>(
		);

@niklasad1
Copy link
Member

niklasad1 commented May 26, 2022

Yes, that's when connecting the RPC endpoint under the hood in subxt.

so if you could try something like this instead to try connect a couple of times if the RPC endpoint is busy or something:

let mut i = 0;

let client = loop {
       let err = match WsClientBuilder::default() {
         .connection_timeout(std::time::Duration::from_secs(60)).
         .build(&node)
         .await {
           Ok(c) => break c,
           Err(e) => e
        };
        
       i += 1;
       
      // what you think is a reasonable limit
      if i > 10 {
        return Err(err.into()); 
      }
      tokio::time::sleep(std::time::Duration::from_secs(1)).await;
}

let api = ClientBuilder::new()
		.set_client(client)
		.build()
		.await?
		.to_runtime_api::<runtime::RuntimeApi<DefaultConfig, PolkadotExtrinsicParams<DefaultConfig>>>(
		);

But as it doesn't work for Oliver in python either there is mostly something flaky with the RPC endpoint configuration.

@ggwpez
Copy link
Member

ggwpez commented May 26, 2022

so if you could try something like this instead to try connect a couple of times if the RPC endpoint is busy or something:

Yea I added a retry loop. First time it gets the EOF error and then just "connection refused". Nothing in the node log.

@bernardoaraujor
Copy link
Contributor Author

You can replace env_logger with tracing-subscriber to get the logs from jsonrpsee on the client side.

I modified the sender like this:

 async fn main() -> Result<(), Box<dyn std::error::Error>> {
-       env_logger::init_from_env(
-               env_logger::Env::default().filter_or(env_logger::DEFAULT_FILTER_ENV, "info"),
-       );
+       tracing_subscriber::fmt().with_env_filter(tracing_subscriber::EnvFilter::from_default_env()).init();
 
        let cli = Cli::parse();
        match cli.command {

and then called zombienet with:

$ RUST_LOG="jsonrpsee=trace,info,subxt=debug,soketto=trace" ./zombienet-linux -p kubernetes test  tests/stps/relay.feature 

got these huge logs:
logs.txt

cc @niklasad1

@niklasad1
Copy link
Member

@pepoviola

How are you fetching the WS ports from the "polkadot nodes" to connect to?

From the logs I can see that old nodes are responsive except ws://127.0.0.1:38049 which is surprising, the node itself says :9944 but I guess it's behind something else.

It would be good to know if the actual port is open, like doing netstat or nmap port scanning :)

@pepoviola
Copy link
Contributor

pepoviola commented May 30, 2022

Hi @niklasad1,
Zombienet open a port-forwarding between the pod port (9944) and the process using kubectl, that why you see random ports. I try on my end and it keep looping with this logs

[2022-05-30T16:34:32Z TRACE soketto] read 2 bytes
[2022-05-30T16:34:32Z TRACE soketto::connection] 726989bd: recv: (Text (fin 1) (rsv 000) (mask (0 0)) (len 37))
[2022-05-30T16:34:32Z TRACE soketto::connection] 726989bd: send: (Text (fin 1) (rsv 000) (mask (1 ad7145e8)) (len 123))
[2022-05-30T16:34:32Z TRACE soketto::connection] 726989bd: Sender flushing connection

Also, I just try to connect using polkadotjs and works

image

Could be the case that the port-forwarding tunnel is down but should be a log about that.

Thanks!

@ggwpez
Copy link
Member

ggwpez commented May 30, 2022

@pepoviola can I somehow print all port forwardings with the kubectl that are currently active?

@pepoviola
Copy link
Contributor

@pepoviola can I somehow print all port forwardings with the kubectl that are currently active?

Yes, @ggwpez you can run this command in another terminal

ps -f | grep 'kubectl' | grep 'port-forward' | awk '{print "PID:" $2 " " $10 " " $11}'

And you get something like this

PID:43198 Pod/alice :9944
PID:43204 Pod/alice :9615
PID:43277 Pod/stps-0 :9944
PID:43280 Pod/stps-0 :9615
PID:43286 service/tempo-tempo-distributed-query-frontend :3100

And then you can see the open port for each PID

lsof -p <PID> |grep TCP |grep LISTEN

I can write an small bash script if you needed.
Thanks!

@bernardoaraujor
Copy link
Contributor Author

bernardoaraujor commented Jun 1, 2022

I'm investigating from a different perspective.
I spawn a network with: $ ./zombienet-linux -p kubernetes spawn tests/stps/relay.json

Then I manually fill the PORT* variables (taken from zombienet's console logs) and execute the following shell script:

PORT0=34253
PORT1=43669
PORT2=35933

./utils/target/release/utils send-balance-transfers --node-url ws://127.0.0.1:$PORT0 --node-index 0 --total-nodes 3 &
./utils/target/release/utils send-balance-transfers --node-url ws://127.0.0.1:$PORT1 --node-index 1 --total-nodes 3 &
./utils/target/release/utils send-balance-transfers --node-url ws://127.0.0.1:$PORT2 --node-index 2 --total-nodes 3 &

the error never happens on this approach, which indicates that there's something about the way zombienet is executing the js-script feature with multiple nodes.

@bernardoaraujor
Copy link
Contributor Author

@pepoviola sorry it seems 1.2.39 still hasn't solved the issue.

this includes the latest nonce changes by @ggwpez

    ✔ alice: is up within 600 secs (82ms)
[2022-06-10T13:03:40Z INFO  utils] Checking sTPS pre-conditions (account nonces and free balances).
rust process exited with code 0
    ✔ alice: js-script ./utils.js with "check_pre_conditions" return is 0 within 600 secs (667ms)
[2022-06-10T13:03:41Z INFO  utils] Node 1: Reading funded accounts from: "tests/stps/funded-accounts.json"
Error: Rpc(Transport(Error in the WebSocket handshake: i/o error: unexpected end of file

Caused by:
    0: i/o error: unexpected end of file
    1: unexpected end of file))
rust process exited with code 1
child process exited
creating new port-fw for Pod/stps-1, with map 37347:9944
[2022-06-10T13:03:41Z INFO  utils] Node 0: Reading funded accounts from: "tests/stps/funded-accounts.json"
[2022-06-10T13:03:41Z INFO  utils] Node 2: Reading funded accounts from: "tests/stps/funded-accounts.json"
Error: Rpc(Transport(Error in the WebSocket handshake: i/o error: unexpected end of file

Caused by:
    0: i/o error: unexpected end of file
    1: unexpected end of file))
rust process exited with code 1
child process exited
creating new port-fw for Pod/stps-2, with map 38287:9944
[2022-06-10T13:03:41Z INFO  utils::sender] Node 0: signing 5461 transactions
[2022-06-10T13:03:51Z INFO  utils::sender] Node 0: sending 5461 transactions in chunks of 50
[2022-06-10T13:03:52Z INFO  utils::sender] Node 0: 850 txs sent in 1016 ms (836.18 /s)
[2022-06-10T13:03:53Z INFO  utils::sender] Node 0: 900 txs sent in 1002 ms (898.01 /s)
[2022-06-10T13:03:54Z INFO  utils::sender] Node 0: 900 txs sent in 1016 ms (885.66 /s)

@ggwpez
Copy link
Member

ggwpez commented Jun 10, 2022

Okay thanks for testing @bernardoaraujor !
One thing that I forgot to add is the automatic re-connect logic.
I will push that later. AFAIK @pepoviola is now re-starting the port forwarding if it broke, so we just need to add re-connection logic. I have it locally but not yet pushed.

@bernardoaraujor
Copy link
Contributor Author

although #28 was merged, the problem remains

@pepoviola
Copy link
Contributor

although #28 was merged, the problem remains

Hi @bernardoaraujor, I just run the test and works as expected.

[2022-06-13T20:46:23Z INFO  utils] Calculating TPS on finalized blocks.
[2022-06-13T20:46:23Z INFO  utils::shared] Attempt #1: Connecting to ws://127.0.0.1:49941
[2022-06-13T20:46:30Z INFO  utils::tps] TPS on block 7: 143.2374
[2022-06-13T20:46:31Z INFO  utils::tps] TPS on block 8: 412.1081
[2022-06-13T20:46:32Z INFO  utils::tps] TPS on block 9: 299.63297
[2022-06-13T20:46:33Z INFO  utils::tps] TPS on block 10: 269.73395
[2022-06-13T20:46:35Z INFO  utils::tps] TPS on block 13: 148.16219
[2022-06-13T20:46:36Z INFO  utils::tps] TPS on block 14: 156.10036
[2022-06-13T20:46:37Z INFO  utils::tps] TPS on block 15: 147.37903
[2022-06-13T20:46:37Z INFO  utils::tps] TPS on block 16: 163.18878
[2022-06-13T20:46:38Z INFO  utils::tps] TPS on block 17: 179.31447
[2022-06-13T20:46:39Z INFO  utils::tps] TPS on block 18: 397.03125
[2022-06-13T20:46:40Z INFO  utils::tps] TPS on block 19: 402.2564
[2022-06-13T20:46:40Z INFO  utils::tps] average TPS: 247.10406
rust process exited with code 0
    ✔ alice: js-script ./utils.js with "calculate_tps,3" return is 0 within 600 secs (17943ms)

	 Node's logs are available in /var/folders/rz/1cyx7hfj31qgb98d8_cg7jwh0000gn/T/zombie-5d5238ed6350cd19da0404799c8fc9dd_-17306-UgKICx49LC80/logs

	 Deleting network


  4 passing (9m)

But I think, and I talk about this with @ggwpez that moving the sender to a pod in the cluster will remove all the friction that we are having here. I don't make much progress yet, but I will keep posting here the progress.

Thanks!

@bernardoaraujor
Copy link
Contributor Author

bernardoaraujor commented Jun 13, 2022

Hi @bernardoaraujor, I just run the test and works as expected.

I got it to work a few times too, but it seemed to be just luck. If I tried repeatedly, it actually fails more frequently than not.
Are you successful on every attempt?

But I think, and I talk about this with @ggwpez that moving the sender to a pod in the cluster will remove all the friction that we are having here. I don't make much progress yet, but I will keep posting here the progress.

awesome!

@bernardoaraujor
Copy link
Contributor Author

after many repeated tests on v0.9.30, I noticed this issue completely disappeared.

so I incrementally increased every version from v0.9.22 (the last version we tested a few months ago) until v0.9.30 and found out that the error is present on v0.9.26 and disappears from v0.9.27 onwards.

I'm tempted to close this issue as the problem is gone, but it would be interesting to have some theories on what could have been the root cause.

@niklasad1 does this info on the versions bring any insights? were there any meaningful changes on jsonrpsee between v0.9.26 and 0.9.27?

@niklasad1
Copy link
Member

niklasad1 commented Oct 27, 2022

@niklasad1 does this info on the versions bring any insights? were there any meaningful changes on jsonrpsee between v0.9.26 and 0.9.27?

The major thing is that v0.9.27 fixed a CORS bug in the jsonrpsee-ws-server nothing else really sticks out.
Great 👍

@bernardoaraujor
Copy link
Contributor Author

reopening as my investigation in October was a false alarm.

issue persists, it's just not reproducible in a deterministic way.

@bernardoaraujor
Copy link
Contributor Author

@pepoviola about this:

But I think, and I talk about this with @ggwpez that moving the sender to a pod in the cluster will remove all the friction that we are having here. I don't make much progress yet, but I will keep posting here the progress.

Is this already possible on zombienet? I think this is the only viable solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants