Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib/client.js this.activeQuery sometimes null #3174

Open
rhodey opened this issue Mar 20, 2024 · 14 comments
Open

lib/client.js this.activeQuery sometimes null #3174

rhodey opened this issue Mar 20, 2024 · 14 comments

Comments

@rhodey
Copy link

rhodey commented Mar 20, 2024

Hello All, thanks to all the contributors who keep this project going, I have an issue to report

Before today I ran into this issue in prod approx 6 times total in ~12 months, but today for some reason this issue has turned up six times.

We have approx 24 services with 200 tasks running on AWS ECS and 90% of these services make use of this pg library in one way or another, another maybe important note is that they all use the pg.Pool

So the stack trace we get in prod from multiple services is:

TypeError: Cannot read properties of null (reading 'name')
at Client._handleParseComplete (/app/node_modules/pg/lib/client.js:380:26)
at Connection.emit (node:events:517:28)
at Connection.emit (node:domain:489:12)
at /app/node_modules/pg/lib/connection.js:116:12
at Parser.parse (/app/node_modules/pg-protocol/dist/parser.js:40:17)
at Socket.<anonymous> (/app/node_modules/pg-protocol/dist/index.js:11:42)
at Socket.emit (node:events:517:28)
at Socket.emit (node:domain:489:12)
at addChunk (node:internal/streams/readable:335:12)
at readableAddChunk (node:internal/streams/readable:308:9)

So most of that stack trace does not say a lot but I think what it is saying is to see this line here:

if (this.activeQuery.name) {

And it must be saying that this.activeQuery is null

Now I could open a PR and modify this function:

  _handleParseComplete(msg) {
    // if a prepared statement has a name and properly parses
    // we track that its already been executed so we don't parse
    // it again on the same client
    if (this.activeQuery.name) {
      this.connection.parsedStatements[this.activeQuery.name] = this.activeQuery.text
    }
  }

And simply change it to:

  _handleParseComplete(msg) {
    // if a prepared statement has a name and properly parses
    // we track that its already been executed so we don't parse
    // it again on the same client
    if (this.activeQuery && this.activeQuery.name) {
      this.connection.parsedStatements[this.activeQuery.name] = this.activeQuery.text
    }
  }

There is a chance that is all that is needed but someone more knowledgeable about the library would know if this is maybe a symptom of a larger problem. Anyone online care to comment ??

I am concerned about patching the library and testing it out on our prod stack because I don't want my edit to cause the lib to somehow be confused and make a serious error.

Can anyone comment on if this edit seems like a safe thing to try we are not able to reproduce this outside of prod.

I can say that the services effected today none of them use transactions maybe that is helpful info

pg version 8.10.0

Thanks!!

@brianc
Copy link
Owner

brianc commented Mar 20, 2024

First of all: sorry that's happening. That's troubling to hear. Do you have any self contained way to reproduce the issue or more about your environment you could describe? Having a way to reproduce this would be wonderful as I'm sure its an easy fix when it comes down to it.

That code is fine to completely skip or comment out on a fork of your own. I doubt you're even using the name config property on a query, and if you are, this simply disables the query's "only do the parse step once" part of the extended query protocol. It's a usually a very small optimization at all to used named queries/statements. I'm not sure that's the only place that assumes activeQuery is not null (it shouldn't be null if you're receiving a parse complete message because this means you're in the middle of the extended query protocol for example and definitely in the middle of an active query. Anyways...tl; dr - feel free to make it on your side. I really wish there was some way we could repro that here, as its something that deserves a fix if its a problem in our code.

Are you using pg-bouncer?
What version of postgres are you using, btw?
Do you know what query is triggering this?

I'll take a look at the code a bit more in the coming weeks & keep this issue in mind. I'm in the process of spinning down my full-time job right now so I can focus more on open source, starting early April!

@rhodey
Copy link
Author

rhodey commented Mar 20, 2024

Hello again and thanks for the quick reply!!

After further review of past 24 hours we have also found another seemingly related trace:

at readableAddChunk (node:internal/streams/readable:308:9)
at addChunk (node:internal/streams/readable:335:12)
at Socket.emit (node:domain:489:12)
at Socket.emit (node:events:517:28)
at Socket.<anonymous> (/app/node_modules/pg-protocol/dist/index.js:11:42)
at Parser.parse (/app/node_modules/pg-protocol/dist/parser.js:40:17)
at /app/node_modules/pg/lib/connection.js:117:12
at Connection.emit (node:domain:489:12)
at Connection.emit (node:events:517:28)
at Client._handleCommandComplete (/app/node_modules/pg/lib/client.js:382:22)
TypeError: Cannot read properties of null (reading 'handleCommandComplete')

This seems to be pointing to:

this.activeQuery.handleCommandComplete(msg, this.connection)

Another optimistic edit would be change this:

  _handleCommandComplete(msg) {
    // delegate commandComplete to active query
    this.activeQuery.handleCommandComplete(msg, this.connection)
  }

To this:

  _handleCommandComplete(msg) {
    // delegate commandComplete to active query
    if (this.activeQuery) {
      this.activeQuery.handleCommandComplete(msg, this.connection)
    }
  }

More about our environment:

We are Timescale customers and we use their managed db as a service
The Postgres version of our managed db is 15.6
The Timescale version of our managed db is 2.14.2

In our prod environment we do make use of pgbouncer we are at version 1.21.0-p2
And we also know that Timescale makes use of "pgproxy" however I cannot say what version they are using
So lots of things in the middle

Additionally we are not using the pg native option

Thanks again!

@rhodey
Copy link
Author

rhodey commented Mar 20, 2024

Do you know what query is triggering this?

Many of our services use a mix of SELECT, UPDATE, and INSERT however I can say that for one of the services which has been effected today this service is 100% SELECT statements and the service is not using prepared statements.

@brianc
Copy link
Owner

brianc commented Mar 20, 2024

Thanks! The only thing I can think of that's triggering this is postgres response messages are coming in "out of order" and the client is getting confused. Not sure if pgbouncer or pgproxy is the culprit or something else but this is how the (extended) protocol works (mostly found here)

client sends -> parse - includes query text
postgres response <- parseComplete
client sends -> bind - includes parameters
postgres responds <- bindComplete
client sends -> describe
postgres responds <- rowDescription - includes name and type of each row going to be returned
client sends -> execute
postgres response <- rowData
postgres response <- rowData
postgres response <- rowData
postgres response <- commandComplete
client sends -> sync
postgres response <- readyForQuery

Only when the pg client receives readyForQuery does it remove its this.activeQuery property from itself. You can put the if (this.activeQuery) in the commandComplete section as well as the other above section, but you wont have it send a sync message which might result in a stalled connection. I think you could try adding the if check, and then if there is no active query, maybe log out some information and then skip the step.

My hunch (might be wildly wrong) is that one of the proxies or pg-bouncer is getting the postgres response messages out of order or sent to the wrong client? Particularly when you say this:

the service is not using prepared statements.

If a service isn't using prepared statements then its never even getting into the extended query protocol and those clients should never receieve a parseComplete or bindComplete message. In the case of a query without parameters there is no parse bind steps at all. 🤔

@brianc
Copy link
Owner

brianc commented Mar 20, 2024

I think something I could work on with ya at some point possibly is at least making this not a catastrophic uncatchable error which is likely crashing the node process, but instead makes the client emit an error you can handle. This way you could catch it and then (I would highly recommend) close the client out & throw it away as it somehow got its internal state machine poisoned w/ what is likely out of order messaging (which is v strange)

@rhodey
Copy link
Author

rhodey commented Mar 20, 2024

OK I've already forked the repo and I've started down the path of making some edits to the lib which we can deploy to prod

This way you could catch it and then (I would highly recommend) close the client out & throw it away

About this we use pg.Pool in every service which uses pg, is it possible to tell the pool to destroy/evict a specific connection?

Thanks!!

@rhodey
Copy link
Author

rhodey commented Mar 20, 2024

OK I've modified lib/client.js to check that this.activeQuery is non-null before every access and I think this code is probably good to deploy to the SELECT-only prod service. About the "stalled connection" we use query_timeout = 30 * 1000 with all our pg.Pool instances so I would think that worst case the stall lasts for 30 seconds and then triggers the timeout, sound right?

@charmander
Copy link
Collaborator

Note that “prepared statements” in the sense of the extended query protocol also includes any query with a parameter.

Like the other times this error has come up in issues: I would definitely not suggest ignoring the problem by checking for activeQuery === null without investigating what’s actually going wrong, because what you’re concerned about:

I am concerned about patching the library and testing it out on our prod stack because I don't want my edit to cause the lib to somehow be confused and make a serious error.

is absolutely a possibility.

  • Are you using pg-cursor/pg-query-stream or any other query objects?
  • Can you show how you’re using the pool? Any manual connection management with connect and release, or is it all pool.query?

@rhodey
Copy link
Author

rhodey commented Mar 20, 2024

Hey @charmander thanks for writing back!

OK I was not aware that parameterized queries count towards prepared statements we do use 99% parameterized queries

We have a support plan with Timescale and I can point them towards this issue to see if maybe they can help server-side

So since we are in the context of the extended query protocol it does make sense to me that maybe the naive approach of skipping over the this.activeQuery access if it is null (falsey) could lead to some bad results even if we only trial this change with the SELECT-only service

Like results/rows for one query are returned to the wrong caller

If you have any recommendations of further instrumentation/debugging we can add to our stack to investigate this we'll do what we can to get it in place

Thanks again!!

@rhodey
Copy link
Author

rhodey commented Mar 20, 2024

Could it be of any use to trial the pg native feature with our SELECT only service ?

@brianc
Copy link
Owner

brianc commented Mar 21, 2024

Could it be of any use to trial the pg native feature with our SELECT only service ?

You could try that out - it might not have the problem. It might throw away messages out of order internally or be more lax w/ them in some way? Note: node-libpq doesn't compile on node 20 yet - that's also on my list for next month!

About this we use pg.Pool in every service which uses pg, is it possible to tell the pool to destroy/evict a specific connection?

Two ways to do this:

  1. happens automatically if you use pool.query and the client used to internally execute the query throws an error
  2. if you are checking out and checking in clients manually to the pool (for example if you doing transactions) you can use this. Basically pass true to client.release(true) to destroy the client.

OK I was not aware that parameterized queries count towards prepared statements we do use 99% parameterized queries

Sorry yeah I should have been more clear - any query which contains parameters goes through the "extended query protocol."

If you have any recommendations of further instrumentation/debugging we can add to our stack to investigate this we'll do what we can to get it in place

In side client.js there's a function that attaches a handler to the connection for each message. You can also attach a generic listener for con.on('message') and handle all messages. If you kept a queue of say the last 50 messages a client receieved and if you ran into the issue outlined here you logged out the last 50 message names (don't even need the entirety of the message) you could have an ordered list of the last things that come in. It might be useful to see if there is a readyForQuery message getting sent before a parseComplete or commandComplete message. I still have a hunch something somewhere upstream is sending messages out of order.

I gotta go to bed - got my eyes dialated today and man alive it sucks to read. I'll be back w/ more during the day tomorrow. Sucks so bad we can't reproduce this in any isolated way - would really love to figure out what this is and fix it.

@rhodey
Copy link
Author

rhodey commented Mar 21, 2024

Hey @brianc oh my yes I remember the last time I had my eyes dilated and I had to find my way onto the subway after it was not easy! Last night just before I passed out something @charmander raised about the parameterized query made me remember about the new prepared statement support in pgbouncer.

max_prepared_statements
When this is set to a non-zero value PgBouncer tracks protocol-level named prepared statements related commands sent by the client in transaction and statement pooling mode. PgBouncer makes sure that any statement prepared by a client is available on the backing server connection. Even when the statement was originally prepared on another server connection.

This tracking and rewriting of prepared statement commands does not work for SQL-level prepared statement commands, so PREPARE, EXECUTE and DEALLOCATE are forwarded straight to Postgres. The exception to this rule are the DEALLOCATE ALL and DISCARD ALL commands, these do work as expected and will clear the prepared statements that PgBouncer tracked for the client that sends this command.

https://www.pgbouncer.org/config.html#max_prepared_statements

In prod we are using pgbouncer 1.21.0 so the max_prepared_statements config param is supported but we have not specified a value for this in our config and the docs say the default value is zero (disabled). For pgbouncer we use the "transaction" pooling mode and when it was first recommended to us to use pgbouncer we asked about this parameterized queries and it was explained to us that parameterized queries dont count towards prepared statements but it seems maybe that was not correct.

It seems likely to me from both of your responses that we should probably give max_prepared_statements >= 1 a try I think we will try = 200 as the docs suggest and it makes sense knowing what I know about our number of unique queries.

Just a question about this bit

This tracking and rewriting of prepared statement commands does not work for SQL-level prepared statement commands, so PREPARE, EXECUTE and DEALLOCATE are forwarded straight to Postgres.

I think what they are saying is if the client lib is using the binary protocol to prepare statements then pgbouncer will rewrite and support them but if the library is using text strings this pgbouncer feature will not work with transaction pooling mode, do you read this the same way and can you say if the library is using the binary protocol or text strings?

About rolling this out I am going to spend the day doing some testing locally with max_prepared_statements >= 1 to make sure that existing functionality is not made worse and then if all goes well we can plan to deploy this change to prod after 7 or 8 PM tonight.

Thanks!!

@brianc
Copy link
Owner

brianc commented Mar 21, 2024

text strings this pgbouncer feature will not work with transaction pooling mode, do you read this the same way and can you say if the library is using the binary protocol or text strings?

yeah I am pretty sure what that's saying is if you put a PREPARE bla bla bla within your sql text, pbgouncer wont do anything with it -- which makes sense because that would require pgbouncer fully parsing every packet, pulling out the sql text, and then parsing the sql text with a postgres compatible sql parser. Further evidence of that is the binary protocol doesn't have a message for DEALLOCATE.

As far as I know there's not a lot of difference between a prepared statement and a parameterized query. From my understanding every time you execute a parameterized query you technically create a prepared statement without a name so every parameterized query replaces the unnamed prepared statement. If you do this

await client.query({
  text: 'SELECT $1::text',
  params: ['hello world'],
})

you end up telling postgres

1) parse this query text: "select $1::text", remember this statement is called ''
2) bind 'hello world' to parameter 1 
3) tell me what the result is going to look like 
4) give me the result

However, if you do this

await client.query({
  text: 'SELECT $1::text',
  params: ['hello world'],
  name: 'hello world query'
})

you end up telling postgres

1) parse this query text: "select $1::text", remember this statement is called 'hello world query'
2) bind 'hello world' to parameter 1
3) tell me what the result is going to look like
4) give me the result

the next time you execute a query on the same client named hello world query node-postgres looks up internally if its already executed a query named hello world query with the same text, if it has, it sends slightly fewer packets to postgres:

For example this:

await client.query({
  text: 'SELECT $1::text',
  params: ['goodbye for now'],
  name: 'hello world query'
})
11) bind 'goodbye for now' to parameter 1 of the statement called 'hello world query'
2) tell me what the result is going to look like
3) give me the result

Effectively you skip the parse step (which is where the query is planned for the most part AFAIK). This saves 1 network roundtrip and perhaps a millisecond or two for query planning. I wouldn't recommend using this all the time, but I've used in the past when we had a couple extremely large query texts (like 1000 lines long) and we execute that query regularly, we can skip the network round-trip of sending a 100kb of query text over the network hundreds of time a second.

In my experience its a micro-optimization that doesn't really do much...but anyways I'm just getting into the weeds here to explain how things work, and to explain why I often conflate the terms "prepared statement" and "parameterized query" - they're very, very similar from a protocol perspective.

@rhodey
Copy link
Author

rhodey commented Mar 21, 2024

@brianc thanks a lot for this explanation and how you describe it makes good sense to me,

I spent this morning trying to create a POC which would reproduce the bug locally however I have yet to reproduce it maybe you have some ideas on what can be changed to increase the chance of reproducing

https://github.com/rhodey/pgbouncetest

What the test does

The idea is to create a large number of different parameterized queries and exec them in parallel while connected to pgbouncer with max_prepared_statements disabled. What I think is happening in prod is that ConnectionA is getting prepared statement messages which are meant for a query on ConnectionB do you agree?

My laptop is relatively low performance w/ 4 CPUs and I was unable to have any of the children exit with an error, I had a co-worker with a 24 CPU machine run this test for 5 minutes and they also had no children processes crash.

Note that pgbouncer max_db_connections is set to 2 which I believe is what you'd want to maximize the chance of one connection seeing prepared statement messages from another connection.

I would be real happy to reproduce this issue locally and then have the max_prepared_statements = 200 config prevent the problem and then it would be a no brainer to deploy this to prod but so far that hasn't happened

I think if today goes by and we are still unable to reproduce the issue locally before tonight we may still end up deploying a modified pgbouncer with max_prepared_statements = 200

Actually in prod today we have yet to see a single instance of either of these errors turn up, not since last night March 20 22:26 Eastern Time

Something about this test is that it is unlikely that any of the children get one or more rows returned from the random queries just based on how the code has been designed, I did not think it was important to have the queries returning data but perhaps you would know better and I could refactor,

Thanks again!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants