Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't add new command when connection is in closed state #2321

Closed
tuananh opened this issue Nov 13, 2017 · 44 comments
Closed

Can't add new command when connection is in closed state #2321

tuananh opened this issue Nov 13, 2017 · 44 comments
Labels

Comments

@tuananh
Copy link

tuananh commented Nov 13, 2017

This happened to me today when i upgrade to 0.14.0. Not sure when the bug was introduced.

Also, i saw this along with the above message: This socket has been ended by the other party

Previously working version seems to be 0.13.0

node 8.9.1, knex 0.14.0, "mysql2": "1.4.2"

@elhigu
Copy link
Member

elhigu commented Nov 13, 2017

Which db, how to reproduce? has it worked in some earlier version?

Also please try all knex versions to pinpoint the release where that feature started working differently.

@tuananh
Copy link
Author

tuananh commented Nov 13, 2017

Updated.

@elhigu
Copy link
Member

elhigu commented Nov 13, 2017

Which db, how to reproduce? has it worked in some earlier version?

Ok, so it seems to be new bug. Could you provide the code for reproducing this problem?

@tuananh
Copy link
Author

tuananh commented Nov 13, 2017

I'm not sure how it happens but when it does, a simple knex.select(1) will fail.

@leebenson
Copy link

@elhigu - per #1833 (comment), I'm seeing this error when I have long-running connections (typically 1-3 days old, that have perhaps timed out or just disconnected.)

I'm also seeing:

{ Error: This socket has been ended by the other party
at TLSSocket.writeAfterFIN [as write] (net.js:355:12)
at Connection.connection.write (/app/node_modules/mysql2/lib/connection.js:389:20)
at Connection.writePacket (/app/node_modules/mysql2/lib/connection.js:306:10)
at Query.start (/app/node_modules/mysql2/lib/commands/query.js:43:14)
at Query.Command.execute (/app/node_modules/mysql2/lib/commands/command.js:40:20)
at Connection.handlePacket (/app/node_modules/mysql2/lib/connection.js:515:28)
at Connection.addCommand (/app/node_modules/mysql2/lib/connection.js:536:10)
at Connection.query (/app/node_modules/mysql2/lib/connection.js:621:15)
at /app/node_modules/knex/lib/dialects/mysql/index.js:156:18
at Promise._execute (/app/node_modules/bluebird/js/release/debuggability.js:303:9)
at Promise._resolveFromExecutor (/app/node_modules/bluebird/js/release/promise.js:483:18)
at new Promise (/app/node_modules/bluebird/js/release/promise.js:79:10)
at Client_MySQL2._query (/app/node_modules/knex/lib/dialects/mysql/index.js:150:12)
at Client_MySQL2.query (/app/node_modules/knex/lib/client.js:211:17)
at Runner. (/app/node_modules/knex/lib/runner.js:149:36)
at Runner.tryCatcher (/app/node_modules/bluebird/js/release/util.js:16:23)
at Runner.query (/app/node_modules/bluebird/js/release/method.js:15:34)
at /app/node_modules/knex/lib/runner.js:61:21
at tryCatcher (/app/node_modules/bluebird/js/release/util.js:16:23)
at /app/node_modules/bluebird/js/release/using.js:185:26
at tryCatcher (/app/node_modules/bluebird/js/release/util.js:16:23)
at Promise._settlePromiseFromHandler (/app/node_modules/bluebird/js/release/promise.js:512:31)
at Promise._settlePromise (/app/node_modules/bluebird/js/release/promise.js:569:18)
at Promise._settlePromise0 (/app/node_modules/bluebird/js/release/promise.js:614:10)
at Promise._settlePromises (/app/node_modules/bluebird/js/release/promise.js:693:18)
at Promise._fulfill (/app/node_modules/bluebird/js/release/promise.js:638:18)
at PromiseArray._resolve (/app/node_modules/bluebird/js/release/promise_array.js:126:19)
at PromiseArray._promiseFulfilled (/app/node_modules/bluebird/js/release/promise_array.js:144:14) code: 'EPIPE', fatal: true }

... just like @tuananh

@tuananh
Copy link
Author

tuananh commented Nov 13, 2017

I'm seeing this error when I have long-running connections (typically 1-3 days old, that have perhaps timed out or just disconnected.)

Yes. It seems to be same in my case as well

@rkaw92
Copy link

rkaw92 commented Nov 14, 2017

Hi, just today we've observed (by placing a few good ol' console.logs) that the factory.validate function which knex.js provides to generic-pool is never called, which might explain how already-dead connections are still being dispensed from the resource pool. Supposedly, generic-pool will validate resources before dispensing, and if they seem broken, create and dispense a new one (in this case, connection) as a replacement. However, this only happens if the right options are passed to generic-pool: https://github.com/coopernurse/node-pool/blob/master/lib/Pool.js#L222

As seen in the code above, the pool will only actually run the validator function if the option testOnBorrow has been enabled. It is false by default, and knex.js does not seem to override it when creating the pool.

@elhigu
Copy link
Member

elhigu commented Nov 15, 2017

@rkaw92 sounds good, now we just need a test case and make knex to enable that setting always. I will do that if I can get first one other bug fixed that I'm working on.

@elhigu
Copy link
Member

elhigu commented Nov 15, 2017

I debugged issue #2325 last night and it is also pool related and I have already test case for it. I probably need to review the whole pool code through and check why connection error exceptions are not passed to knex and why dead connections are not evicted.

@rkaw92
Copy link

rkaw92 commented Nov 15, 2017

@elhigu As far as I can tell, knex does react to connection errors by setting the __knex__disposed property - it's just that the pool isn't using the validator so it doesn't know to evict those "tagged" connections.

Now, I'm also tracking this pg pull request which fixes some pg-specific issues regarding connection loss. If you were testing with pg, you might have run into this, instead. No idea about other drivers, though.

@leebenson
Copy link

Thanks for tracking this @rkaw92, appreciated. FWIW, I'm using mysql2, so the errors are present with other drivers.

Hoping we can figure this out soon in our build-- at least once every day or two, the API process needs restarting because of this. Order forms, member dashboards, everything goes down. I wound up putting in a hard process.exit(1) for Kubernetes to re-build the lost Pod, and those few mins of respawn time are mounting up. I certainly appreciate your help @rkaw92 !

@rkaw92
Copy link

rkaw92 commented Nov 16, 2017

@leebenson One thing to try without code modification is to pass testOnBorrow to knex, like so:

const db = require('knex')({
  client: '...',
  connection: '...',
  pool: { testOnBorrow: true }
})

This should tell you if it's caused by the connection recycling bug. Of course, it's just a temporary workaround, but it would be helpful if we can see whether it's this or another sneaky one.

If successful, at the time when the connection fails, you're going to see this printed to console.log:

Knex:warning - Connection Error: Connection ended unexpectedly

And the next thing should be the pool allocating a fresh connection for you.

@leebenson
Copy link

Good to know, thanks. Should min and max values still work with testOnBorrow: true?

@rkaw92
Copy link

rkaw92 commented Nov 16, 2017

Yeah, you can pass other pool options, too.

@Yunoo
Copy link

Yunoo commented Nov 16, 2017

@rkaw92 I am testing this workaround since yesterday morning and it seems to work fine. Before that I had to restart API processes at least once a day, otherwise database connection will eventually close without reconnection.

@leebenson
Copy link

Ah, drats. Seeing this in the logs, even with testOnBorrow: true:

This socket has been ended by the other party

@rkaw92
Copy link

rkaw92 commented Nov 17, 2017 via email

@leebenson
Copy link

Thanks, I'll add the extra debug checks. And of course, sockets do get closed. But the expectation of a pool is that it's going to anticipate sockets being closed, and re-connect automatically/silently to reattempt the query. Isn't that the point of a pool? If something drops, handle the logic to reconnect?

Throwing an error every time a long-running socket disconnects is definitely not expected. Connections will drop multiple times per day, which is expected. What's not expected is that I have to design for that scenario in code. I'm not even sure how that'd work. Wrap each SQL call in a block that traps an error, and reattempts the same SQL query? Implement timeouts? Handle retry logic? I figured all that stuff was the domain of a connection pool.

@rkaw92
Copy link

rkaw92 commented Nov 17, 2017

@leebenson You're correct in saying that the job of the pool is to handle reconnection. However, it does this at very specific points: before handing out the client (connection) object to the user. When the client is in the user's hands, there are several reasons why it can't transparently reconnect nor retry queries:

  1. It represents a real connection to the database - that is, a stream. A stream's lifecycle ends when it closes and a new stream is required, so the old one must be returned for recycling. (That a knex client object maps 1:1 to a real connection is this library's design choice and I won't go into a for/against about it, but it seems to be the standard in DB libraries.)
  2. SQL connections are stateful. There is an amount of state (settings such as ANSI_QUOTES for mySQL; transaction state,) associated with each, and knex does not do state tracking internally (that would require understanding dialect specifics). One exception is transaction state - it's for this reason that .transaction() manages getting/returning clients to the pool.
  3. SQL queries on their own are not retriable in the general case - imagine an application that increases a customer's account balance via an UPDATE query. Without application-specific knowledge of how to de-duplicate such an operation, it would become quite profitable to pull out a network cable every now and then...

Assuming the pool now works (i.e. the connections it gives out are healthy at the time of dispesing from the pool), all that remains is handling errors that occur mid-transaction. A common strategy is to retry the entire transaction and make the transaction block self-sufficient (i.e. assumes no previous state). It doesn't seem too different from retrying transactions that fail to COMMIT because of serialization failures (in SERIALIZABLE), either.

Of course, if the periods between acquiring and releasing a connection from the pool in your code are supposed to be very brief (i.e. no long-lived transactions with prolonged idle periods mid-transaction), and you're still getting socket errors, something might be going on (you'd expect connections to time out when there's no activity, not in the middle of it!). Still, some load-balancing solutions (haproxy?) like to limit the max total duration of a connection and can be set to not care about idle-ness.

Definitely check if the connections you're getting are the same old broken ones, or new ones that just disconnect because of network conditions. Also determining if the DB is behind some kind of proxy might tell you something about who to blame for the disconnects.

@leebenson
Copy link

Thanks for the detailed insight, @rkaw92.

In this particular case, the MySQL instance is hosted on Google Cloud. No proxies are being used. Something is definitely wrong, somewhere -- I'm seeing This socket has been ended by the other party both in long-running production (Docker instances hosted via a Kubernetes cluster, connected across the Google Cloud network) and on my local machine to the public IP, where the server has barely been connected 5 minutes.

With testOnBorrow: true, resubmitting the query a few times seems to re-connect the server and execute successfully eventually, but I'm running into this far too often for this to be normal. 1 in every 3 SQL statements should not be failing. I've never encountered this before in any other library, and I'm not really sure what I should be doing to guard against it.

Are there any code examples for handling timeouts and testing whether a connection is still alive before executing SQL?

I also noticed this in the generic-pool readme:

testOnBorrow: boolean: should the pool validate resources before giving them to clients. Requires that either factory.validate or factory.validateAsync to be specified

Are any of those specified by Knex (like a default select 1)? Should I be doing something manually there?

I appreciate your help with this. Knex is a really awesome query builder, but this connection stuff has me stumped.

@elhigu
Copy link
Member

elhigu commented Nov 17, 2017

indeed looks like validate reject doesn't doesn't automatically try to find working connection... I have test case already for this, but still need to figure out work around.

@rkaw92
Copy link

rkaw92 commented Nov 17, 2017 via email

@rkaw92
Copy link

rkaw92 commented Nov 17, 2017

@leebenson Knex passes a pool validator function which checks a flag and rejects the connection if the flag has been set. In theory, when the connection breaks, the underlying driver should asynchronously set the flag so that it's picked up when the connection is being acquired from the pool.

I'm starting to think this may be mysql2-related: sidorares/node-mysql2#447

If mysql2 suffers from an error event-eating bug, then this would be the result. Whether it does, I currently have no way of telling. It may be worthwhile to pursue the mysql driver issue mentioned above, and at the same time, simply retrying the queries/transactions for the time being. I do not believe you should (or need) to be testing connection readiness manually, in any case.

@tswayne
Copy link

tswayne commented Nov 17, 2017

FWIW I have also verified this bug using the standard mysql driver too (and mysql2), and fixed when bumping the version back down. It also occurred during long running connections but I can repro by restarting mysql while the app using knex is running. @elhigu thanks for the quick response to the issue!

@elhigu
Copy link
Member

elhigu commented Nov 19, 2017

@rkaw92 @tswayne now that I wrote test for it and tried the testOnBorrow : true setting again. For some reason pool validate functionality seems to work correctly and I cannot anymore reproduce the problem, where connections closed by DB are returned from the pool.

Validate factory method indeed seem to do the reasonable thing, which is that if resource is bad, it evicts it from the pool and tries to create new resource to replace it and ultimately returns the fresh resource back. I'll be releasing 0.14.1 in few hours.

@elhigu
Copy link
Member

elhigu commented Nov 19, 2017

Hmm.. looks like there is still something fishy going on... running the test case alone worked fine, but running all tests still gives Error: select 1 - This socket has been ended by the other party, so maybe not releasing in few hours after all... gotta keep on debugging a bit more.

Edit: actually this last problem seems like timing issue. DB error event of closed connection has not been emitted yet when I tried to make new queries to killed connections. So validate did pass, but connection didn't work. I don't thing there is anything (well we could make async check, but that would be too wasteful) that we can do to make that part work better, 2ms delay was enough locally to get error events registered before making new queries.

@elhigu
Copy link
Member

elhigu commented Nov 19, 2017

Please test this with 0.14.1 and lets reopen if there are still problems.

@leebenson
Copy link

same issue with 0.14.1:

This socket has been ended by the other party

@elhigu
Copy link
Member

elhigu commented Nov 25, 2017

@leebenson please open new issue, share some test case or guide how to reproduce and which mysql driver are you using. Only way that happen that I can think is that for some reason connection closed event is not recognised... but according to test case that I wrote it is. There might be windows of few milliseconds, between connection close and that being noticed by the pool, so if you are able to hit that window with query you should be getting that error.

@leebenson
Copy link

@elhigu - this is my connection string:

const connection = knex({
    client: "mysql2",
    connection: {
        database: process.env.MYSQL_DATABASE,
        host: process.env.MYSQL_HOST,
        password: process.env.MYSQL_PASSWORD,
        ssl: {
            ca: getCert("server-ca"), // `getCert` is a local function to get the UTF-8 cert string
            cert: getCert("client-cert"),
            key: getCert("client-key"),
        },
        user: process.env.MYSQL_USER,
    },
    pool: {
        max: 5,
        min: 2,
        testOnBorrow: true,
    },
});

I'm accessing the DB very infrequently - literally 2-5 SQL commands per day, max. It's used as part of an API server that is processing sales via a new (limited access) order form.

The API server is loaded from a Docker instance, running on Google Container Engine. MySQL is running via Google managed DB, and connected to directly over public IPv4.

I see This socket has been ended by the other party literally once per day. In error logs, it appears that the error appears twice (i.e. the customer tries to order, sees an error, and tries again.) On the third time, it works. The only parallel I see is that pool is set to min: 2 -- no idea if this is just a coincidence. I'm presuming this is because connections have been idle for a while... but can't tell why new connections aren't being created (or even tested.)

I'll need switch to another lib as a priority because issue is surfacing daily at point-of-sale, and effectively blocking the API. I'll try and get some test cases up when I have some time to play with it. But I think if I'm facing this issue, others must be too... I'm not doing anything exciting, just connecting to the MySQL DB and issuing very infrequent SQL against it.

@elhigu
Copy link
Member

elhigu commented Nov 25, 2017

@leebenson for 0.14.1 i wrote a testcase which killed connections and then tried to use them. I had that exact error message on 0.14.0 when testOnBorrow was set. To 0.14.1 I fixed that problem... so if you cannot provide some test case I cannot know if the problem is now in your setup or happening generally to everyone. On 0.14.1 testOnBorrow is set automatically so you can try to drop that at least.

@leebenson
Copy link

Will set-up some test cases as soon as I can. Out of interest, how did you kill the connection in your test? Did you do that via knex, or did you somehow (more abruptly) pull the plug? Did you also test timeouts? Thanks for your help.

@elhigu
Copy link
Member

elhigu commented Nov 26, 2017

@leebenson I just told mysql to kill the connection (https://github.com/tgriesser/knex/blob/master/test/tape/pool.js#L105) there are also some docker based tests where actual server process is shutdown. No I haven't tested mysql timeouts... maybe I could check out if they can be set for each connection separately.

@leebenson
Copy link

@elhigu - after a bit of local testing, it seems that Google Cloud SQL simply times out after 10/15 minutes of idle usage.

Apologies if this has been answered elsewhere (couldn't see anything obvious in the docs), but is there is a way to set an 'idle' connection timeout so that an 'old' connection won't even be attempted?

@elhigu
Copy link
Member

elhigu commented Dec 2, 2017

@leebenson it actually really isn't even attempted. It is attempted in a sense that

  1. pool tries to verify connection (checks one internal variable)
  2. if there had been an error it discards old resource and call create
  3. after getting new connection query is sent.

https://github.com/coopernurse/node-pool There are also parameters for setting idle timeout. I believe in addition to ide timeout config (defaults to 30s) one must set evictionRunIntervalMillis. I haven't used those so I don't know if they work the way I believe they do.

@leebenson
Copy link

Thanks @elhigu. There's a thread I've commented on here that @rkaw92 mentioned earlier, which I think is the real underlying issue. The problem seems to be the mysql2 driver not evicting connections.

A workaround suggested in that thread (using sequelize) is to attempt a select 1 before every SQL query, which is obviously far too expensive at scale.

I'm baffled that default logic for handling socket timeouts seems to have been overlooked in the driver. This doesn't just affect Knex- any ORM that uses mysql2 is affected.

In low-traffic environments that use Google Cloud SQL and other managed cloud DBs, which typically have an idle timeout, it'll result in This socket has been ended by the other party every time that 10-15 minute window occurs without a SQL command.

I'm amazed this hasn't been reported more.

@leebenson
Copy link

There are also parameters for setting idle timeout. I believe in addition to ide timeout config (defaults to 30s) one must set evictionRunIntervalMillis. I haven't used those so I don't know if they work the way I believe they do.

I'll test this today, thanks @elhigu.

@elhigu
Copy link
Member

elhigu commented Dec 2, 2017

Thanks @elhigu. There's a thread I've commented on here that @rkaw92 mentioned earlier, which I think is the real underlying issue. The problem seems to be the mysql2 driver not evicting connections.

That might be case some times, but in my tests mysql2 was emitting connection was closed error or something like that when I killed the connection in knex tests here https://github.com/tgriesser/knex/blob/master/test/tape/pool.js#L105

I'll try to reproduce and see if that really is a problem in mysql2. If that is the case, maybe knex with mysql2 driver should set some defaults to generic-pool so that pool evicts the connection, before mysql closes the connection.

@leebenson
Copy link

That might be case some times, but in my tests mysql2 was emitting connection was closed error or something like that when I killed the connection in knex tests here https://github.com/tgriesser/knex/blob/master/test/tape/pool.js#L105

I'm still wondering if there's a distinction between a connection that is explicitly terminated, and one where the connected socket simply times out.

@elhigu
Copy link
Member

elhigu commented Dec 2, 2017

Just tried by setting idle connection timeout

Mikaels-MacBook-Pro-2:objection.js mikaelle$ cat test-mysql2.js 
async function main() {
  const  mysql = require('mysql2/promise');
  const connection = await mysql.createConnection({host:'localhost', user: 'root', database: 'knex_test'});
  connection.on('error', e => console.log(e));
  await connection.execute('set global interactive_timeout=3;');
  await connection.execute('set global wait_timeout=3;');
  const [rows, fields] = await connection.execute('SELECT CONNECTION_ID();');
  console.log('PROCESS', rows);
}

main();
Mikaels-MacBook-Pro-2:objection.js mikaelle$ time node test-mysql2.js 
PROCESS [ BinaryRow { 'CONNECTION_ID()': 23 } ]
{ Error: Connection lost: The server closed the connection.
    at Socket.<anonymous> (/Users/mikaelle/Projects/Vincit/yessql/objection.js/node_modules/mysql2/lib/connection.js:113:35)
    at emitNone (events.js:111:20)
    at Socket.emit (events.js:208:7)
    at endReadableNT (_stream_readable.js:1056:12)
    at _combinedTickCallback (internal/process/next_tick.js:138:11)
    at process._tickCallback (internal/process/next_tick.js:180:9) fatal: true, code: 'PROTOCOL_CONNECTION_LOST' }

real	0m3.152s
user	0m0.120s
sys	0m0.025s
Mikaels-MacBook-Pro-2:objection.js mikaelle$ 

Seems to work as expected. Maybe the linked issue exist only when using pool from mysql driver.

@leebenson
Copy link

FWIW, I switched to Sequelize and the problem disappeared. Not sure if they're setting evictionRunIntervalMillis under the hood, but they have a pool function that looks like this to detect whether the connection has dropped.

@cyorobert
Copy link

cyorobert commented Dec 19, 2017

Any update on this issue @elhigu I just updated to 0.14.2 in production and began seeing the warning Knex:warning - Connection Error: Error: Connection lost: The server closed the connection. come up every couple of minutes when I run transactions.

Oddly enough, this did not occur locally for me running node 7.10.1 on a macbook pro. Is there a reason that running transactions would cause the server to close down connections?

I ended up downgrading to 0.13.0, which resolved the issue.

@elhigu
Copy link
Member

elhigu commented Dec 29, 2017

@cyorobert do you have some code example that I could use to reproduce the behaviour?

@RobertYoung
Copy link

I found that this was being caused due to importing a large file. I fixed it by setting "max_allowed_packet=32505856"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants