Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: use lagoon api for environment checks and rbac #236

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

shreddedbacon
Copy link
Member

@shreddedbacon shreddedbacon commented Jun 21, 2023

WIP

This changes the ssh-portal to use Lagoon API for environment checks and RBAC instead of talking to the database directly, and evaluating the users token to see if they have permission to access an environment.

This eliminates the need to include features like BLOCK_DEVELOPER_SSH and leverages the built in API RBAC for evaluating a user has access to an environment. Other benefits of this, are if in the future Lagoon introduces custom roles, or other methods of restricting access, changes are not required to the ssh-portal.

Acknowledging that this does use the Lagoon API, it will add some delay to SSH connections because it is hitting the API instead of direct DB connections, but the recent API authentication changes should mean this delay is minimal, even on large Lagoon cores.

Utilising the machinery library also means if there are API schema changes, we can catch them easier instead of having to worry if the database queries ever change with no easy visibility.

Relies on

@smlx
Copy link
Member

smlx commented Jun 22, 2023

FWIW I'm not a fan of this approach for a couple of reasons:

  • this adds coupling to, and dependency on, the Lagoon API, Lagoon API Redis etc.
  • it calls through the Lagoon API -> Keycloak API -> Keycloak JS engine (where the auth logic actually occurs). Does the Lagoon API need to be in this chain?

I think better approaches to achieve similar outcomes to this PR could be:

  • add the auth logic to the ssh-portal Keycloak client. That way it could directly call Keycloak API -> Keycloak JS engine (auth logic), and the auth policy is still centralised (as it is in this PR).
  • move the lagoondb package into the machinery module so the SQL queries can be maintained in one place.

Is there a particular issue this PR is addressing?

@shreddedbacon
Copy link
Member Author

FWIW I'm not a fan of this approach for a couple of reasons:

  • this adds coupling to, and dependency on, the Lagoon API, Lagoon API Redis etc.
  • it calls through the Lagoon API -> Keycloak API -> Keycloak JS engine (where the auth logic actually occurs). Does the Lagoon API need to be in this chain?

It's tied to the API DB already, why is it this really any different to using the API instead? I get there is more in the dependency chain, but we're relying on both Keycloak and the API DB currently, where the API also relies on both of things being running.

I think better approaches to achieve similar outcomes to this PR could be:

  • add the auth logic to the ssh-portal Keycloak client. That way it could directly call Keycloak API -> Keycloak JS engine (auth logic), and the auth policy is still centralised (as it is in this PR).
  • move the lagoondb package into the machinery module so the SQL queries can be maintained in one place.

The problem with moving the lagoondb package to machinery is that the API handles all the database queries using knex. So we still have to maintain the lagoondb queries in machinery too if the schema changes in knex. I think it is better to let the API do what it needs to do with DB queries, and we use the standard machinery library to interact with the API.

Once we start to introduce custom roles in Lagoon, the JS logic won't work there, so it will go away. Which would mean another change to the ssh-portal to remove that functionality and replace it with how the API works.

Is there a particular issue this PR is addressing?

Yes, making the ssh-portal adhere to the APIs RBAC model instead of using its own.

@smlx
Copy link
Member

smlx commented Jun 22, 2023

FWIW I'm not a fan of this approach for a couple of reasons:

  • this adds coupling to, and dependency on, the Lagoon API, Lagoon API Redis etc.
  • it calls through the Lagoon API -> Keycloak API -> Keycloak JS engine (where the auth logic actually occurs). Does the Lagoon API need to be in this chain?

It's tied to the API DB already, why is it this really any different to using the API instead?

The API and API redis are two separate services which are coupled in this PR, and two more points of failure. Redis is also not HA.

The problem with moving the lagoondb package to machinery is that the API handles all the database queries using knex. So we still have to maintain the lagoondb queries in machinery too if the schema changes in knex. I think it is better to let the API do what it needs to do with DB queries, and we use the standard machinery library to interact with the API.

Ok, fair point. But does this mean that every microservice interaction that wants to pull data from the Lagoon DB has to go via HTTP and GraphQL to the API (i.e. very inefficiently)?

Once we start to introduce custom roles in Lagoon, the JS logic won't work there, so it will go away. Which would mean another change to the ssh-portal to remove that functionality and replace it with how the API works.

Is this logic going to live in the Lagoon API itself? Or another policy service? Maybe this change should wait until custom roles are implemented?

Is there a particular issue this PR is addressing?

Yes, making the ssh-portal adhere to the APIs RBAC model instead of using its own.

Doesn't the RBAC model currently live in Keycloak, rather than the Lagoon API?


I guess one of the implicit features of the ssh-portal was that it avoided touching the Lagoon API, so this feels like a step backwards to me.

Another question: what sort of token will the ssh-portal use for accessing the Lagoon API?

@shreddedbacon
Copy link
Member Author

FWIW I'm not a fan of this approach for a couple of reasons:

  • this adds coupling to, and dependency on, the Lagoon API, Lagoon API Redis etc.
  • it calls through the Lagoon API -> Keycloak API -> Keycloak JS engine (where the auth logic actually occurs). Does the Lagoon API need to be in this chain?

It's tied to the API DB already, why is it this really any different to using the API instead?

The API and API redis are two separate services which are coupled in this PR, and two more points of failure. Redis is also not HA.

Yeah fair on the additional failure points, but I think they are acceptable mainly because of other reasons outlined in replies.

redis is not HA, but we could figure out how to install a HA redis somehow fairly easily I would hope to reduce that problem. But talking about HA, API DB is not HA, Keycloak isn't either. These are also on our hitlist for HA at some stage.

The problem with moving the lagoondb package to machinery is that the API handles all the database queries using knex. So we still have to maintain the lagoondb queries in machinery too if the schema changes in knex. I think it is better to let the API do what it needs to do with DB queries, and we use the standard machinery library to interact with the API.

Ok, fair point. But does this mean that every microservice interaction that wants to pull data from the Lagoon DB has to go via HTTP and GraphQL to the API (i.e. very inefficiently)?

Yes, it would all go via the API, within the confines of the cluster we can use the internal service names though.

I get the point you're making here though, but Lagoon and all of its current services have not been built to have the services handle their own transactions outside of the API. ssh-portal is the only one that does everything outside of the API, which means everyone needs to then understand that the ssh-portal handles this on its own. If Lagoon was build where every microservice handled its own DB and RBAC transactions, then it would be a different story.

If we were to implement custom roles in Lagoon today, then ssh-portal would need to know how to handle these, so we'd basically have to duplicate effort around RBAC in the API and ssh-portal to support them. Where if it was all in the API, the scope is smaller.

Once we start to introduce custom roles in Lagoon, the JS logic won't work there, so it will go away. Which would mean another change to the ssh-portal to remove that functionality and replace it with how the API works.

Is this logic going to live in the Lagoon API itself? Or another policy service? Maybe this change should wait until custom roles are implemented?

I think it would live in the API, as this makes the most sense. Adding a dedicated policy service would introduce a whole host of problems that I don't think we really want to deal with. If the API did it all, everything in this PR would just continue to work without any changes, or minimal ones if at all.

Ideally, I'd like the API to have a query like canUserSSHToEnvironment(user: $id, environment: $envID){ bool }. That way we wouldn't have to do the multiple steps like we do now with:

  • generating a token as the requested user
  • then either
    • validating the roles/groups/project ids
    • executing a query as the user.

And as we start to move away from doing permission checks using the JS logic, this becomes easier to implement.

Is there a particular issue this PR is addressing?

Yes, making the ssh-portal adhere to the APIs RBAC model instead of using its own.

Doesn't the RBAC model currently live in Keycloak, rather than the Lagoon API?

Yes, but nothing leverages it outside of the API anywhere else at this stage, and eventually will be gone when we remove the JS logic checks, so the API still is the place it happens. The only direct keycloak interactions are by users logging in, or auth-server/ssh-portal generating tokens on behalf of a user.

I guess one of the implicit features of the ssh-portal was that it avoided touching the Lagoon API, so this feels like a step backwards to me.

Sure, it is a bit of a step backwards if you look at how other microservices are built. But in terms of how Lagoon is currently structured, ssh-portal breaks the standard model we have. When all existing services use the lagoon API, then introducing one single service that does its own DB queries, and its own RBAC checks, is totally different to everything else we do. It means everyone needs to be absolutely aware that any changes to permissions need to make sure that the ssh-portal isn't using them.

The main problem is the permissions separation that the ssh-portal has, which is solved by the API. But if we're going to solve that with the API, then moving all the direct DB queries to API queries just makes it all share the same system.

Another question: what sort of token will the ssh-portal use for accessing the Lagoon API?

In this PR it generates a 60 second internal admin token using the JWT secret like other services within lagoon-core do. This token does have admin permission, but is has a limited duration, and the queries it uses are in code. We could shorten the duration the token is created with too.

@smlx
Copy link
Member

smlx commented Jun 23, 2023

Ok, thanks for the detailed explanation. I understand the motivation for this much better now.

The main thing I'm still concerned about is the admin token. Is there any way we could scope it down a bit?

@shreddedbacon
Copy link
Member Author

The only thing I'm still concerned about is the admin token. Is there any way we could scope it down a bit?

At this stage admin tokens aren't able to be scoped explicitly, I know this isn't great. Using a short duration and only the queries defined in this code base limits its scope to an extent though.

@smlx
Copy link
Member

smlx commented Jun 23, 2023

This whole architecture is still quite concerning TBH. We are putting internal service APIs (e.g. userBySSHFingerprint) into the public GraphQL Lagoon API interface when they do not need to be publicly facing. Should there instead be a internal "service API" for this kind of stuff?

@shreddedbacon
Copy link
Member Author

This whole architecture is still quite concerning TBH. We are putting internal service APIs (e.g. userBySSHFingerprint) into the public GraphQL Lagoon API interface when they do not need to be publicly facing. Should there instead be a internal "service API" for this kind of stuff?

Noted. But no there is no plan for an internal only API at this stage.

@shreddedbacon
Copy link
Member Author

Mainly that this API is also usable by administrators. Not just internal services.

@smlx
Copy link
Member

smlx commented Jun 23, 2023

But no there is no plan for an internal only API at this stage.

It already exists - this is what ssh-portal-api is right now.

@smlx
Copy link
Member

smlx commented Jun 23, 2023

Would it make sense for the Lagoon API to replace the ssh-portal-api by listening to NATS directly?

https://github.com/nats-io/nats.js

@shreddedbacon
Copy link
Member Author

Would it make sense for the Lagoon API to replace the ssh-portal-api by listening to NATS directly?

https://github.com/nats-io/nats.js

We plan on moving moving to nats instead of rabbitmq, so yes we could do that at some stage.

@smlx
Copy link
Member

smlx commented Jun 30, 2023

When all existing services use the lagoon API, then introducing one single service that does its own DB queries, and its own RBAC checks, is totally different to everything else we do.

I just realized that this is incorrect. The existing SSH service (that ssh-portal is replacing) does its own SQL queries directly against the DB: https://github.com/uselagoon/lagoon/blob/main/services/ssh/etc/libnss-mysql.cfg

@shreddedbacon
Copy link
Member Author

shreddedbacon commented Jun 30, 2023

When all existing services use the lagoon API, then introducing one single service that does its own DB queries, and its own RBAC checks, is totally different to everything else we do.

I just realized that this is incorrect. The existing SSH service (that ssh-portal is replacing) does its own SQL queries directly against the DB: uselagoon/lagoon@main/services/ssh/etc/libnss-mysql.cfg

The service we plan on deprecating? I'll edit my previous statement to be

When all existing services (except the existing ssh service which we plan on deprecating) use the lagoon API, then introducing one extra service (in addition to the existing ssh service which we plan on deprecating) that does its own DB queries, and its own RBAC checks, is totally different to everything else we do (except for the existing ssh service which we plan on deprecating).

@smlx
Copy link
Member

smlx commented Jul 3, 2023

The service we plan on deprecating?

Okay, but in that case this PR trades planned-to-be-deprecated direct DB access for planned-to-be-deprecated admin token usage.

The former has a drawback for Lagoon developers because:

everyone needs to then understand that the ssh-portal handles this on its own.

However the latter has the drawback that it fundamentally weakens the security of the entire Lagoon system, uses a deprecated OAuth2 profile, and has been directly responsible for at least one production outage that I'm aware of.

I agree that having the authz policy in ssh-portal-api is not ideal and needs to change going forward. However I don't think it's a good idea to revert back to admin token usage to do it before a replacement has been designed (let alone implemented).

I see that you just opened uselagoon/lagoon#3479. Maybe it is time to do some detailed technical design for the new "service authn/authz" policy system to ensure that it will meet the needs of organizations / custom roles / service auth? Then we can move this PR forward towards this new system?

@shreddedbacon
Copy link
Member Author

The service we plan on deprecating?

Okay, but in that case this PR trades planned-to-be-deprecated direct DB access for planned-to-be-deprecated admin token usage.

There is no plan to deprecate the admin token usage for internal services anytime soon, and even if we did, this PR still works with a small change to the token fetching method to get one via keycloak.

The former has a drawback for Lagoon developers because:

everyone needs to then understand that the ssh-portal handles this on its own.

Well no, because all the other newer/refactored internal services do this exact same method of generating a token currently. All the existing javascript based services do things a little bit different but still use an admin token in some way.

However the latter has the drawback that it fundamentally weakens the security of the entire Lagoon system, uses a deprecated OAuth2 profile, and has been directly responsible for at least one production outage that I'm aware of.

This is a topic for not in this PR.

I agree that having the authz policy in ssh-portal-api is not ideal and needs to change going forward. However I don't think it's a good idea to revert back to admin token usage to do it before a replacement has been designed (let alone implemented).

This is a draft. Not a "this is what we're doing right now". If I had known you would take this hard a stance against it, I would have delayed the creation of it. I just wanted something up and ready to go if we needed it.

I see that you just opened uselagoon/lagoon#3479. Maybe it is time to do some detailed technical design for the new "service authn/authz" policy system to ensure that it will meet the needs of organizations / custom roles / service auth? Then we can move this PR forward towards this new system?

Yes, I opened that because we've got people asking for custom roles. No matter the solution, the current way that the ssh-portal does stuff is going to change, but closer to using the API rather than handling oauth etc on its own. Whether it uses service auth to generate a token or what, who knows.

As it is, I hate that we have to run this SSH portal with the current BLOCK_DEVELOPER_SSH hack in it. I just have to make this extremely clear that it is only in place because the alternative is this PR which you hate.

@shreddedbacon
Copy link
Member Author

I'm fine with leaving this here until we decide on something better for Lagoon auth, so this ends the discussion for now.

@smlx
Copy link
Member

smlx commented Jul 3, 2023

As it is, I hate that we have to run this SSH portal with the current BLOCK_DEVELOPER_SSH hack in it. I just have to make this extremely clear that it is only in place because the alternative is this PR which you hate.

That's very strong, and not accurate in my case. I'm not tied to the existing way this works (which I agree is totally a hack). I just want to discuss and understand all the tradeoffs that are being made in either case so we can figure out the best way forward.

@shreddedbacon
Copy link
Member Author

As it is, I hate that we have to run this SSH portal with the current BLOCK_DEVELOPER_SSH hack in it. I just have to make this extremely clear that it is only in place because the alternative is this PR which you hate.

That's very strong, and not accurate in my case. I'm not tied to the existing way this works (which I agree is totally a hack). I just want to discuss and understand all the tradeoffs that are being made in either case so we can figure out the best way forward.

Yeah sorry that that came across as quite strong, it wasn't my intention. I meant it in a sense that there is a dislike of both implementation choices here. I want to see ssh-portal rolled out more broadly, but at the same time I don't want it to be this snowflake service that does things differently to the rest of Lagoon when it comes to determining permissions for a user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants