Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use separate PostgreSQL schemas for sprocs #4412

Merged
merged 30 commits into from Jul 7, 2021
Merged

Use separate PostgreSQL schemas for sprocs #4412

merged 30 commits into from Jul 7, 2021

Conversation

mwest1066
Copy link
Member

This PR changes sql-db so that it can use a local DB schema for all queries. We use this to store a separate set of sprocs (stored procedures) for each invocation of the DB. The startup sequence is:

  1. Start with only the default public schema.
  2. Run migrations, which will modify the global public tables.
  3. Set the default schema to a random string (with fallback to public).
  4. Run sproc init, which will create all the sprocs from scratch in the random schema.
  5. For all future SQL calls, use the random schema so that we get our local copy of the sprocs.

This is intended to avoid problems where one server startup modifies sprocs that are currently in use by another server. Although Postgres has transactional DDL, this can still cause problems with deadlocking via explicit locks. I think this is responsible for #3824

Note that this PR requires #4411 to work, because the changes are in PrairieLib and so we need to be using it directly to pick up the new code.

Copy link
Contributor

@nwalters512 nwalters512 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me - left a few comments, but I don't think they're blocking. I can't see any way this could have a negative impact (famous last words, right)?

I don't see anything that specifically instructs Postgres to create sprops in the random schema - will Postgres automatically create them in the first element of search_path?

prairielib/lib/sql-db.js Outdated Show resolved Hide resolved
prairielib/lib/sql-db.js Outdated Show resolved Hide resolved
server.js Show resolved Hide resolved
@mwest1066
Copy link
Member Author

I don't see anything that specifically instructs Postgres to create sprops in the random schema - will Postgres automatically create them in the first element of search_path?

Yep. This is also why it's important to run migrations (which might create tables) before setting the random schema.

See https://www.postgresql.org/docs/current/ddl-schemas.html#DDL-SCHEMAS-PATH where it says:

The first schema in the search path that exists is the default location for creating new objects.

I also checked that we really are creating the sprocs in the random schema, as expected.

@mwest1066
Copy link
Member Author

I don't see anything that specifically instructs Postgres to create sprops in the random schema - will Postgres automatically create them in the first element of search_path?

Yep. This is also why it's important to run migrations (which might create tables) before setting the random schema.

An alternative which I considered is to explicitly set the schema when you create objects. For example, we could do

CREATE FUNCTION ${schemaName}.variants_insert

for all our sprocs to make sure they are created in exactly the schema we want. However, this is pretty invasive and I believe that using the implicit "create in first search_path entry" behavior is safe.

@nwalters512
Copy link
Contributor

Yep, as long as that behavior is documented, I'm good with relying on it! Thanks for the link to the docs.

@mwest1066
Copy link
Member Author

@nwalters512 Another question is whether we should do any automatic cleanup of either:

  1. The old sprocs in the public schema.
  2. The new schemas that we are creating every time we restart the server.

My current thoughts are to defer (1) until after we've deployed this PR and are happy with things. We could then add a migration to clean up the public sprocs.

For (2) we can either ignore it for now, relying on the fact that having extra schemas lying around doesn't really hurt anything, or we could add code to automatically delete any old schemas for our current server name. That is, the schemas are named like i_329c8aef93_2021-06-21-SDLFK where the first bit is the EC2 instance ID, so on startup we can delete any schemas that start with our own instance ID. We always run all servers with unique IDs, although it would be possible to override this which would lead to problems if we auto-delete.

Thoughts?

@nwalters512
Copy link
Contributor

Yeah, I'm happy to ignore the ones in the public schema almost indefinitely (though we can of course clean them up in the future).

I believe (2) gets trickier once we throw auto scaling and general transient servers into the mix. We're not ever guaranteed to even restart a server on an instance once the process is killed or dies. I'm inclined to instead use the date somehow, e.g. deleting all schemas with creation dates over 30 days ago. That would of course cause issues if there happens to be a server that's alive for 30 days. I'll have to think more about this!

@mwest1066
Copy link
Member Author

Yeah, I'm happy to ignore the ones in the public schema almost indefinitely (though we can of course clean them up in the future).

Sounds good.

I believe (2) gets trickier once we throw auto scaling and general transient servers into the mix. We're not ever guaranteed to even restart a server on an instance once the process is killed or dies. I'm inclined to instead use the date somehow, e.g. deleting all schemas with creation dates over 30 days ago. That would of course cause issues if there happens to be a server that's alive for 30 days. I'll have to think more about this!

That's a good point that autoscaling would mean instances never come back. However, for autoscaling we will soon need to add a table like server_loads (which is for the grader hosts) but for webserver_loads, so we can track how many webservers there are and load levels. Each server updates its row every 10 seconds while it's alive, so it's easy to tell when servers are dead. We use this already for grader hosts to kill them off. The same logic would let us clean up old schemas.

@mwest1066
Copy link
Member Author

@nwalters512 I updated a bunch of things and I believe it's ready to go. Mind taking another quick look?

Copy link
Contributor

@nwalters512 nwalters512 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Fingers crossed that this makes deploys even more robust 🤞

docs/dev-guide.md Outdated Show resolved Hide resolved
prairielib/lib/sql-db.js Show resolved Hide resolved
sprocs/array_and_number.sql Outdated Show resolved Hide resolved
mwest1066 and others added 4 commits June 30, 2021 09:02
Co-authored-by: Nathan Walters <nathan@prairielearn.com>
Co-authored-by: Nathan Walters <nathan@prairielearn.com>
@mwest1066
Copy link
Member Author

Commit 4a966d1 above changes all sprocs from the old pattern:

DROP FUNCTION IF EXISTS my_func()
CREATE OR REPLACE FUNCTION my_func()

to the new pattern:

CREATE FUNCTION my_func()

The reason for this is that the old pattern causes a bug when deploying in an existing multi-server environment. The problem is:

  1. Consider running with an existing prod server and a chunk server, both of which are using the pre-existing code with all sprocs in the public schema.
  2. Deploy this new random-schema code to prod.
  3. The prod server runs code like DROP FUNCTION IF EXISTS my_func(), which will delete the copy of my_func() in the public schema.
  4. The prod server creates a new function my_func() in its random schema.
  5. The chunk server is still only using the public schema so when it tries to use my_func() it dies with a SQL error of cache lookup failed for function.

The same problem happens for sprocs without a DROP FUNCTION because they all use CREATE OR REPLACE FUNCTION which is equivalent to a DROP FUNCTION followed by a CREATE FUNCTION. In the case that the existing function is in the public schema and we have search_path = <random_schema>,public, running CREATE OR REPLACE FUNCTION removes the public function and creates a new one in the random schema.

To fix this, we could either:

  1. On server startup, first use just the public schema to run migrations, then set search_path to only the random schema (with no fallback to public) to run all the DROP/CREATE code for sprocs. Finally, set search_path to <random_schema>,public for actual running.
  2. Change all the sprocs to just do plain CREATE with no DROP or OR REPLACE.

I went with (2) even though it's a much more invasive code change because it makes the resulting code much simpler: we have a simpler startup procedure and also individual sproc functions are simpler and cleaner.

@mwest1066
Copy link
Member Author

mwest1066 commented Jul 1, 2021

In the process of doing 4a966d1 I also discovered that we were using an explicit public schema here:

CREATE OR REPLACE FUNCTION public.first_agg ( anyelement, anyelement )
and here:
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )

I removed both of these schema references so that we will default to the current random schema.

I personally don't like these sprocs because they are using names which are much too generic (first and last) and could easily cause confusion or conflicts with built-in functions. I'm leaving them as-is for now, however.

@mwest1066 mwest1066 merged commit d5c37b3 into master Jul 7, 2021
@mwest1066 mwest1066 deleted the sproc-schemas branch July 7, 2021 21:18
@github-actions github-actions bot locked and limited conversation to collaborators Jul 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants