Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #21 Fix client_secret special characters #27

Merged
merged 23 commits into from May 7, 2020
Merged

Conversation

stikkireddy
Copy link
Contributor

client_secret had special characters which were not unescaped when creating the management token and platform token. This caused issues when trying to make an api call which caused the secret to be invalid. This solution swaps making the api calls manually to using the MSFT autorest adal library for authentication. This should resolve issue #21. I have ran the integration test as well as a test with a sp password of:
value = "!@##@%#$^$#&%^*^(^*(&()l}$GWolISz+kyXy"

…ing azure go adal library to fetch management token and platform token. Updated integration test to use new env variables
@stikkireddy stikkireddy self-assigned this May 1, 2020
return dbClient, err
}

func retry(numAttempts int, sleep time.Duration, do func() error) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This retry should help fix #33 as we will attempt to retry and above we will be retrying for about 1 minute to attempt to list the node types. This should verify that the apis are ready for requests. This is a hack, but I am internally trying to identify the root cause of why the apis are not initialized properly.

@sdebruyn
Copy link
Contributor

sdebruyn commented May 4, 2020

After testing this a couple of times... this PR fixes #21 but I still get #33

Error: status 400: err Response from server {"error_code":"INVALID_PARAMETER_VALUE","message":"Delegate unexpected exception during listing node types: com.databricks.backend.manager.util.UnknownWorkerEnvironmentException: Unknown worker environment WorkerEnvId(workerenv-2102767251748222)"}

@lawrencegripper
Copy link
Contributor

lawrencegripper commented May 5, 2020

I'd expect increasing the retry number to a larger count would resolve the issue in #33. Just running a quick test now to see how long it usually takes to resolve itself in previous hacky provider.

Hmm looks like roughly 30 seconds. Looks like current code is only going to wait about 30secs so likely just a little too short on some occations assuming it's a bit variable due to how long some resource takes to be created.

Transcript started, output file is ./cluster.create.log
Starting create
> sleep 10
CLI Response: Error: b'{"error_code":"BAD_REQUEST","message":"Current organization 8479926402891530 does not have any associated worker environments"}'
PS>TerminatingError(): "Failed to execute Databricks CLI. Error response."
Create failed. Retrying after wait...
05/05/2020 15:34:26
> sleep 10
CLI Response: Error: b'{"error_code":"BAD_REQUEST","message":"Current organization 8479926402891530 does not have any associated worker environments"}'
PS>TerminatingError(): "Failed to execute Databricks CLI. Error response."
Create failed. Retrying after wait...
05/05/2020 15:34:37

> sleep 10
Found ClusterID from Terraform state: 0505-153445-lops296
cluster_id

…ry 10 seconds. This should resolve clusters api not starting properly after workspace creation. currently a hack for now till the near future.
@stikkireddy
Copy link
Contributor Author

@sdebruyn hey just made some changes to use a much longer time out of 30 minutes, can you please run again and verify that this works.

Unfortunately have not gotten much better news, in regards to resolving the root cause of the apis not working correctly after workspace creation. At this point in time I dont know if the problem is caused by the databricks deployment itself or if the azurerm provider is not waiting long enough.

And I cant seem to reproduce the issue any more with the following change made. I was able to run 5/5 times after destroying the environment without causing that issue.

@codecov-io
Copy link

codecov-io commented May 5, 2020

Codecov Report

Merging #27 into master will decrease coverage by 0.19%.
The diff coverage is 32.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #27      +/-   ##
==========================================
- Coverage   49.05%   48.85%   -0.20%     
==========================================
  Files          44       44              
  Lines        5853     5872      +19     
==========================================
- Hits         2871     2869       -2     
- Misses       2931     2954      +23     
+ Partials       51       49       -2     
Flag Coverage Δ
#client 89.77% <100.00%> (ø)
#provider 37.32% <31.74%> (-0.20%) ⬇️
Impacted Files Coverage Δ
databricks/resource_databricks_scim_user.go 26.08% <0.00%> (-0.44%) ⬇️
databricks/utils.go 0.00% <ø> (-42.17%) ⬇️
databricks/provider.go 60.00% <16.07%> (-7.97%) ⬇️
databricks/azure_auth.go 21.27% <46.15%> (ø)
client/service/client.go 70.00% <100.00%> (ø)
databricks/resource_databricks_secret_scope.go 40.74% <100.00%> (ø)

@sdebruyn
Copy link
Contributor

sdebruyn commented May 6, 2020

It worked on the first try now :) databricks_cluster.cluster: Creation complete after 3m35s [id=0506-080854-hums438] Thanks! Awesome work with this provider!

@sdebruyn
Copy link
Contributor

sdebruyn commented May 6, 2020

I don't know if it's related, but just a heads-up, since moving to this branch, I am getting the following errors when I try to create secret scopes:

Error: status 503: err Response from server {"error_code": "TEMPORARILY_UNAVAILABLE", "message": "Authentication is temporarily unavailable. Please try again later."}

I thought it was a temporary issue, but it's been there for a few days now. This does not happen with the code that is currently on master.

@stikkireddy
Copy link
Contributor Author

@sdebruyn is this only right after you make a workspace or at all times?

@sdebruyn
Copy link
Contributor

sdebruyn commented May 6, 2020

@stikkireddy at all times, even hours after creating the workspace and the cluster

@stikkireddy
Copy link
Contributor Author

hmm let me look into that and see if I can recreate this scenario, @sdebruyn what region are you deploying into?

@sdebruyn
Copy link
Contributor

sdebruyn commented May 6, 2020

@stikkireddy eastus2

@lawrencegripper
Copy link
Contributor

lawrencegripper commented May 6, 2020

Would it be possible to merge the auth fixes and continue to look at the root cause of #33 on a separate PR? This would be super useful to unblock us as the temporary fix, while not ideal, does work.

Great to get a longer term fix in a some point in the future.

… auth. Still using adal based auth but still generating databricks personal access token for SP for 1 hour and using that to provision resources. This is because AAD tokens have issues when dealing with secrets apis.
…der deployment much easier for the acceptance testing.
…igure the azure provider to only fetch sp credentials from env. This makes testing much easier.
…les & roles where azure does not have that capability. Also when creating a scim user, there is a weird behavior that makes tests fail where some times the api accidentally creates entitlements of "allow-cluster-create" even though entitlements are empty. Thus the update was added right after to create a put/update request.
…rallel test because parallel tests were causing issues with consistent outcome.
… added terraform-acc-azure and aws targets. Added readme for integration testing
@stikkireddy
Copy link
Contributor Author

Hey @sdebruyn please take a look at the new changes, I have reverted the AAD token to using the PAT token. I realized there are still some issues in using AAD tokens to use the secrets, scopes, scope acls apis. So I reverted it to just use the 1 hr PAT (the tokens within Databricks) tokens. I also fixed up the acceptance tests, and split the AWS and Azure tests. The acceptance tests were failing for users & groups due to roles not existing on azure (aws iam roles). This should be all fixed now. Please give this a try, the retry functionality in the azure auth and the revert to using PAT tokens should resolve both issue #21, #33 and also avoid the following for secrets:
Error: status 503: err Response from server {"error_code": "TEMPORARILY_UNAVAILABLE", "message": "Authentication is temporarily unavailable. Please try again later."}.

TL;DR:
I ran the existing acceptance tests on AWS and Azure and they seem to be working. It is a WIP to integrate to travis. But for now please let me know if this branch resolves #21, #33 & TEMPORARILY_UNAVAILABLE for secrets.

@lawrencegripper this should resolve the issues. At least after the changes I cannot seem to be able to reproduce the problem.

@sdebruyn
Copy link
Contributor

sdebruyn commented May 7, 2020

Now all of the following succeeds:

  • cluster
  • token
  • secret scopes
  • secrets

But then it fails when creating the mounts.

Logs at https://gist.github.com/sdebruyn/e86699686b120b444b2d0be4dd10c922

@sdebruyn
Copy link
Contributor

sdebruyn commented May 7, 2020

I think that this might not be related to this PR, but to my service principal instead. I'll further look into it in the coming hours.

@stikkireddy
Copy link
Contributor Author

That seems to be a auth level issue you may need to make your SP a storage blob data owner. Hmm not sure what it is at the moment. But will merge this as it seems your mount is a auth issue (403) and @lawrencegripper is going to work on better logging for the mounts and help terraform propagate the error a bit better.

@stikkireddy
Copy link
Contributor Author

This will be closing #33 and #21

@stikkireddy stikkireddy merged commit 6bce373 into master May 7, 2020
@sdebruyn
Copy link
Contributor

sdebruyn commented May 7, 2020

correct, one of the permissions was missing :)

@stikkireddy stikkireddy deleted the issue-21 branch May 7, 2020 10:17
@nfx nfx added the azure Occurring on Azure cloud label Feb 23, 2022
@mgyucht mgyucht mentioned this pull request Oct 25, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
azure Occurring on Azure cloud bug Something isn't working
Projects
None yet
5 participants