-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: use lagoon api for environment checks and rbac #236
base: main
Are you sure you want to change the base?
Conversation
FWIW I'm not a fan of this approach for a couple of reasons:
I think better approaches to achieve similar outcomes to this PR could be:
Is there a particular issue this PR is addressing? |
It's tied to the API DB already, why is it this really any different to using the API instead? I get there is more in the dependency chain, but we're relying on both Keycloak and the API DB currently, where the API also relies on both of things being running.
The problem with moving the lagoondb package to machinery is that the API handles all the database queries using knex. So we still have to maintain the lagoondb queries in machinery too if the schema changes in knex. I think it is better to let the API do what it needs to do with DB queries, and we use the standard machinery library to interact with the API. Once we start to introduce custom roles in Lagoon, the JS logic won't work there, so it will go away. Which would mean another change to the ssh-portal to remove that functionality and replace it with how the API works.
Yes, making the ssh-portal adhere to the APIs RBAC model instead of using its own. |
The API and API redis are two separate services which are coupled in this PR, and two more points of failure. Redis is also not HA.
Ok, fair point. But does this mean that every microservice interaction that wants to pull data from the Lagoon DB has to go via HTTP and GraphQL to the API (i.e. very inefficiently)?
Is this logic going to live in the Lagoon API itself? Or another policy service? Maybe this change should wait until custom roles are implemented?
Doesn't the RBAC model currently live in Keycloak, rather than the Lagoon API? I guess one of the implicit features of the Another question: what sort of token will the |
Yeah fair on the additional failure points, but I think they are acceptable mainly because of other reasons outlined in replies. redis is not HA, but we could figure out how to install a HA redis somehow fairly easily I would hope to reduce that problem. But talking about HA, API DB is not HA, Keycloak isn't either. These are also on our hitlist for HA at some stage.
Yes, it would all go via the API, within the confines of the cluster we can use the internal service names though. I get the point you're making here though, but Lagoon and all of its current services have not been built to have the services handle their own transactions outside of the API. ssh-portal is the only one that does everything outside of the API, which means everyone needs to then understand that the ssh-portal handles this on its own. If Lagoon was build where every microservice handled its own DB and RBAC transactions, then it would be a different story. If we were to implement custom roles in Lagoon today, then ssh-portal would need to know how to handle these, so we'd basically have to duplicate effort around RBAC in the API and ssh-portal to support them. Where if it was all in the API, the scope is smaller.
I think it would live in the API, as this makes the most sense. Adding a dedicated policy service would introduce a whole host of problems that I don't think we really want to deal with. If the API did it all, everything in this PR would just continue to work without any changes, or minimal ones if at all. Ideally, I'd like the API to have a query like
And as we start to move away from doing permission checks using the JS logic, this becomes easier to implement.
Yes, but nothing leverages it outside of the API anywhere else at this stage, and eventually will be gone when we remove the JS logic checks, so the API still is the place it happens. The only direct keycloak interactions are by users logging in, or auth-server/ssh-portal generating tokens on behalf of a user.
Sure, it is a bit of a step backwards if you look at how other microservices are built. But in terms of how Lagoon is currently structured, ssh-portal breaks the standard model we have. When all existing services use the lagoon API, then introducing one single service that does its own DB queries, and its own RBAC checks, is totally different to everything else we do. It means everyone needs to be absolutely aware that any changes to permissions need to make sure that the ssh-portal isn't using them. The main problem is the permissions separation that the ssh-portal has, which is solved by the API. But if we're going to solve that with the API, then moving all the direct DB queries to API queries just makes it all share the same system.
In this PR it generates a 60 second internal admin token using the JWT secret like other services within lagoon-core do. This token does have admin permission, but is has a limited duration, and the queries it uses are in code. We could shorten the duration the token is created with too. |
Ok, thanks for the detailed explanation. I understand the motivation for this much better now. The main thing I'm still concerned about is the admin token. Is there any way we could scope it down a bit? |
At this stage admin tokens aren't able to be scoped explicitly, I know this isn't great. Using a short duration and only the queries defined in this code base limits its scope to an extent though. |
This whole architecture is still quite concerning TBH. We are putting internal service APIs (e.g. |
Noted. But no there is no plan for an internal only API at this stage. |
Mainly that this API is also usable by administrators. Not just internal services. |
It already exists - this is what |
Would it make sense for the Lagoon API to replace the |
We plan on moving moving to nats instead of rabbitmq, so yes we could do that at some stage. |
I just realized that this is incorrect. The existing SSH service (that ssh-portal is replacing) does its own SQL queries directly against the DB: https://github.com/uselagoon/lagoon/blob/main/services/ssh/etc/libnss-mysql.cfg |
The service we plan on deprecating? I'll edit my previous statement to be
|
Okay, but in that case this PR trades planned-to-be-deprecated direct DB access for planned-to-be-deprecated admin token usage. The former has a drawback for Lagoon developers because:
However the latter has the drawback that it fundamentally weakens the security of the entire Lagoon system, uses a deprecated OAuth2 profile, and has been directly responsible for at least one production outage that I'm aware of. I agree that having the authz policy in ssh-portal-api is not ideal and needs to change going forward. However I don't think it's a good idea to revert back to admin token usage to do it before a replacement has been designed (let alone implemented). I see that you just opened uselagoon/lagoon#3479. Maybe it is time to do some detailed technical design for the new "service authn/authz" policy system to ensure that it will meet the needs of organizations / custom roles / service auth? Then we can move this PR forward towards this new system? |
There is no plan to deprecate the admin token usage for internal services anytime soon, and even if we did, this PR still works with a small change to the token fetching method to get one via keycloak.
Well no, because all the other newer/refactored internal services do this exact same method of generating a token currently. All the existing javascript based services do things a little bit different but still use an admin token in some way.
This is a topic for not in this PR.
This is a draft. Not a "this is what we're doing right now". If I had known you would take this hard a stance against it, I would have delayed the creation of it. I just wanted something up and ready to go if we needed it.
Yes, I opened that because we've got people asking for custom roles. No matter the solution, the current way that the ssh-portal does stuff is going to change, but closer to using the API rather than handling oauth etc on its own. Whether it uses service auth to generate a token or what, who knows. As it is, I hate that we have to run this SSH portal with the current BLOCK_DEVELOPER_SSH hack in it. I just have to make this extremely clear that it is only in place because the alternative is this PR which you hate. |
I'm fine with leaving this here until we decide on something better for Lagoon auth, so this ends the discussion for now. |
That's very strong, and not accurate in my case. I'm not tied to the existing way this works (which I agree is totally a hack). I just want to discuss and understand all the tradeoffs that are being made in either case so we can figure out the best way forward. |
Yeah sorry that that came across as quite strong, it wasn't my intention. I meant it in a sense that there is a dislike of both implementation choices here. I want to see ssh-portal rolled out more broadly, but at the same time I don't want it to be this snowflake service that does things differently to the rest of Lagoon when it comes to determining permissions for a user. |
WIP
This changes the ssh-portal to use Lagoon API for environment checks and RBAC instead of talking to the database directly, and evaluating the users token to see if they have permission to access an environment.
This eliminates the need to include features like
BLOCK_DEVELOPER_SSH
and leverages the built in API RBAC for evaluating a user has access to an environment. Other benefits of this, are if in the future Lagoon introduces custom roles, or other methods of restricting access, changes are not required to the ssh-portal.Acknowledging that this does use the Lagoon API, it will add some delay to SSH connections because it is hitting the API instead of direct DB connections, but the recent API authentication changes should mean this delay is minimal, even on large Lagoon cores.
Utilising the machinery library also means if there are API schema changes, we can catch them easier instead of having to worry if the database queries ever change with no easy visibility.
Relies on