Issue #21 Fix client_secret special characters #27

stikkireddy · 2020-05-01T13:24:14Z

client_secret had special characters which were not unescaped when creating the management token and platform token. This caused issues when trying to make an api call which caused the secret to be invalid. This solution swaps making the api calls manually to using the MSFT autorest adal library for authentication. This should resolve issue #21. I have ran the integration test as well as a test with a sp password of:
value = "!@##@%#$^$#&%^*^(^*(&()l}$GWolISz+kyXy"

…ing azure go adal library to fetch management token and platform token. Updated integration test to use new env variables

…atabricks-terraform into issue-21 � Conflicts: � go.sum

…rm into issue-21

stikkireddy · 2020-05-04T15:22:05Z

databricks/azure_auth.go

 	return dbClient, err
 }
+
+func retry(numAttempts int, sleep time.Duration, do func() error) error {


This retry should help fix #33 as we will attempt to retry and above we will be retrying for about 1 minute to attempt to list the node types. This should verify that the apis are ready for requests. This is a hack, but I am internally trying to identify the root cause of why the apis are not initialized properly.

sdebruyn · 2020-05-04T17:50:58Z

After testing this a couple of times... this PR fixes #21 but I still get #33

Error: status 400: err Response from server {"error_code":"INVALID_PARAMETER_VALUE","message":"Delegate unexpected exception during listing node types: com.databricks.backend.manager.util.UnknownWorkerEnvironmentException: Unknown worker environment WorkerEnvId(workerenv-2102767251748222)"}

lawrencegripper · 2020-05-05T15:37:12Z

I'd expect increasing the retry number to a larger count would resolve the issue in #33. Just running a quick test now to see how long it usually takes to resolve itself in previous hacky provider.

Hmm looks like roughly 30 seconds. Looks like current code is only going to wait about 30secs so likely just a little too short on some occations assuming it's a bit variable due to how long some resource takes to be created.

Transcript started, output file is ./cluster.create.log
Starting create
> sleep 10
CLI Response: Error: b'{"error_code":"BAD_REQUEST","message":"Current organization 8479926402891530 does not have any associated worker environments"}'
PS>TerminatingError(): "Failed to execute Databricks CLI. Error response."
Create failed. Retrying after wait...
05/05/2020 15:34:26
> sleep 10
CLI Response: Error: b'{"error_code":"BAD_REQUEST","message":"Current organization 8479926402891530 does not have any associated worker environments"}'
PS>TerminatingError(): "Failed to execute Databricks CLI. Error response."
Create failed. Retrying after wait...
05/05/2020 15:34:37

> sleep 10
Found ClusterID from Terraform state: 0505-153445-lops296
cluster_id

…ry 10 seconds. This should resolve clusters api not starting properly after workspace creation. currently a hack for now till the near future.

stikkireddy · 2020-05-05T17:43:30Z

@sdebruyn hey just made some changes to use a much longer time out of 30 minutes, can you please run again and verify that this works.

Unfortunately have not gotten much better news, in regards to resolving the root cause of the apis not working correctly after workspace creation. At this point in time I dont know if the problem is caused by the databricks deployment itself or if the azurerm provider is not waiting long enough.

And I cant seem to reproduce the issue any more with the following change made. I was able to run 5/5 times after destroying the environment without causing that issue.

codecov-io · 2020-05-05T17:53:06Z

Codecov Report

Merging #27 into master will decrease coverage by 0.19%.
The diff coverage is 32.28%.

@@            Coverage Diff             @@
##           master      #27      +/-   ##
==========================================
- Coverage   49.05%   48.85%   -0.20%     
==========================================
  Files          44       44              
  Lines        5853     5872      +19     
==========================================
- Hits         2871     2869       -2     
- Misses       2931     2954      +23     
+ Partials       51       49       -2

Flag	Coverage Δ
#client	`89.77% <100.00%> (ø)`
#provider	`37.32% <31.74%> (-0.20%)`	⬇️

Impacted Files	Coverage Δ
databricks/resource_databricks_scim_user.go	`26.08% <0.00%> (-0.44%)`	⬇️
databricks/utils.go	`0.00% <ø> (-42.17%)`	⬇️
databricks/provider.go	`60.00% <16.07%> (-7.97%)`	⬇️
databricks/azure_auth.go	`21.27% <46.15%> (ø)`
client/service/client.go	`70.00% <100.00%> (ø)`
databricks/resource_databricks_secret_scope.go	`40.74% <100.00%> (ø)`

sdebruyn · 2020-05-06T08:13:27Z

It worked on the first try now :) databricks_cluster.cluster: Creation complete after 3m35s [id=0506-080854-hums438] Thanks! Awesome work with this provider!

sdebruyn · 2020-05-06T10:24:33Z

I don't know if it's related, but just a heads-up, since moving to this branch, I am getting the following errors when I try to create secret scopes:

Error: status 503: err Response from server {"error_code": "TEMPORARILY_UNAVAILABLE", "message": "Authentication is temporarily unavailable. Please try again later."}

I thought it was a temporary issue, but it's been there for a few days now. This does not happen with the code that is currently on master.

stikkireddy · 2020-05-06T10:27:17Z

@sdebruyn is this only right after you make a workspace or at all times?

sdebruyn · 2020-05-06T10:53:48Z

@stikkireddy at all times, even hours after creating the workspace and the cluster

stikkireddy · 2020-05-06T12:31:17Z

hmm let me look into that and see if I can recreate this scenario, @sdebruyn what region are you deploying into?

sdebruyn · 2020-05-06T12:32:13Z

@stikkireddy eastus2

lawrencegripper · 2020-05-06T13:10:26Z

Would it be possible to merge the auth fixes and continue to look at the root cause of #33 on a separate PR? This would be super useful to unblock us as the temporary fix, while not ideal, does work.

Great to get a longer term fix in a some point in the future.

… auth. Still using adal based auth but still generating databricks personal access token for SP for 1 hour and using that to provision resources. This is because AAD tokens have issues when dealing with secrets apis.

…der deployment much easier for the acceptance testing.

…igure the azure provider to only fetch sp credentials from env. This makes testing much easier.

…les & roles where azure does not have that capability. Also when creating a scim user, there is a weird behavior that makes tests fail where some times the api accidentally creates entitlements of "allow-cluster-create" even though entitlements are empty. Thus the update was added right after to create a put/update request.

…This should not be nil.

…rallel test because parallel tests were causing issues with consistent outcome.

… added terraform-acc-azure and aws targets. Added readme for integration testing

… development

…rm into issue-21

…ponents.

…fo to the make target's echo statement

stikkireddy · 2020-05-07T05:36:16Z

Hey @sdebruyn please take a look at the new changes, I have reverted the AAD token to using the PAT token. I realized there are still some issues in using AAD tokens to use the secrets, scopes, scope acls apis. So I reverted it to just use the 1 hr PAT (the tokens within Databricks) tokens. I also fixed up the acceptance tests, and split the AWS and Azure tests. The acceptance tests were failing for users & groups due to roles not existing on azure (aws iam roles). This should be all fixed now. Please give this a try, the retry functionality in the azure auth and the revert to using PAT tokens should resolve both issue #21, #33 and also avoid the following for secrets:
Error: status 503: err Response from server {"error_code": "TEMPORARILY_UNAVAILABLE", "message": "Authentication is temporarily unavailable. Please try again later."}.

TL;DR:
I ran the existing acceptance tests on AWS and Azure and they seem to be working. It is a WIP to integrate to travis. But for now please let me know if this branch resolves #21, #33 & TEMPORARILY_UNAVAILABLE for secrets.

@lawrencegripper this should resolve the issues. At least after the changes I cannot seem to be able to reproduce the problem.

sdebruyn · 2020-05-07T07:43:08Z

Now all of the following succeeds:

cluster
token
secret scopes
secrets

But then it fails when creating the mounts.

Logs at https://gist.github.com/sdebruyn/e86699686b120b444b2d0be4dd10c922

sdebruyn · 2020-05-07T07:45:53Z

I think that this might not be related to this PR, but to my service principal instead. I'll further look into it in the coming hours.

stikkireddy · 2020-05-07T08:23:49Z

That seems to be a auth level issue you may need to make your SP a storage blob data owner. Hmm not sure what it is at the moment. But will merge this as it seems your mount is a auth issue (403) and @lawrencegripper is going to work on better logging for the mounts and help terraform propagate the error a bit better.

stikkireddy · 2020-05-07T08:52:03Z

This will be closing #33 and #21

sdebruyn · 2020-05-07T09:40:42Z

correct, one of the permissions was missing :)

stikkireddy added 3 commits May 1, 2020 09:14

refactored the azure auth process removed pat token generation and us…

145b0da

…ing azure go adal library to fetch management token and platform token. Updated integration test to use new env variables

updated modules.txt

7ca5fb3

removed commented import

036596d

This was linked to issues May 1, 2020

[ISSUE] Azure token auth fails with service principal secret with special characters #21

Closed

[FEATURE] AAD Token based Auth for Azure Databricks #11

Closed

stikkireddy added the bug Something isn't working label May 1, 2020

stikkireddy added 2 commits May 1, 2020 10:04

Merge branches 'issue-21' and 'master' of github.com:databrickslabs/d…

cc5156f

…atabricks-terraform into issue-21 � Conflicts: � go.sum

updated vendor and gosum

827d374

stikkireddy self-assigned this May 1, 2020

stikkireddy added 2 commits May 1, 2020 16:33

Merge branch 'master' of github.com:databrickslabs/databricks-terrafo…

630fb0d

…rm into issue-21

set branch to master

cebaa88

sdebruyn approved these changes May 4, 2020

View reviewed changes

stikkireddy linked an issue May 4, 2020 that may be closed by this pull request

[ISSUE] UnknownWorkerEnvironmentException when creating a cluster after creating a workspace #33

Closed

stikkireddy commented May 4, 2020

View reviewed changes

lawrencegripper mentioned this pull request May 5, 2020

[ISSUE] UnknownWorkerEnvironmentException when creating a cluster after creating a workspace #33

Closed

added validateWorkspaceApis to time out after 30 minutes and poll eve…

5a6ff5d

…ry 10 seconds. This should resolve clusters api not starting properly after workspace creation. currently a hack for now till the near future.

any branch should be able to comment for codecov

667a41e

sdebruyn approved these changes May 6, 2020

View reviewed changes

increased client timeout duration to 60 seconds from 10 seconds

f9c76be

stikkireddy added 12 commits May 7, 2020 00:37

add environment variables for the entire azure auth to make the provi…

3ee5d02

…der deployment much easier for the acceptance testing.

fetch the CLOUD_ENV environment variable and if it is azure just conf…

ab34a49

…igure the azure provider to only fetch sp credentials from env. This makes testing much easier.

another issue with integration tests, the default should be "users". …

67ce163

…This should not be nil.

fixed principal for secret acls to users and made everything a non pa…

97d0f9f

…rallel test because parallel tests were causing issues with consistent outcome.

removed unused functions and tests for unused functions

5036c9a

updated makefile to lint different tags, to avoid function conflicts,…

9485eae

… added terraform-acc-azure and aws targets. Added readme for integration testing

formatted and added docs for linting and renamed from testing to just…

14cd666

… development

Merge branch 'master' of github.com:databrickslabs/databricks-terrafo…

4610ae4

…rm into issue-21

moved linting and acceptance testing documentation before project com…

863e8d1

…ponents.

added comments to makefile for the acceptance tests and added more in…

eb3d631

…fo to the make target's echo statement

stikkireddy removed a link to an issue May 7, 2020

[FEATURE] AAD Token based Auth for Azure Databricks #11

Closed

added the environment variables to the enduser provider docs

2d21363

stikkireddy merged commit 6bce373 into master May 7, 2020

stikkireddy deleted the issue-21 branch May 7, 2020 10:17

nfx added the azure Occurring on Azure cloud label Feb 23, 2022

mgyucht mentioned this pull request Oct 25, 2023

Release 1.29.0 #2838

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue #21 Fix client_secret special characters #27

Issue #21 Fix client_secret special characters #27

stikkireddy commented May 1, 2020

stikkireddy May 4, 2020

sdebruyn commented May 4, 2020

lawrencegripper commented May 5, 2020 •

edited

stikkireddy commented May 5, 2020

codecov-io commented May 5, 2020 •

edited

sdebruyn commented May 6, 2020

sdebruyn commented May 6, 2020

stikkireddy commented May 6, 2020

sdebruyn commented May 6, 2020

stikkireddy commented May 6, 2020

sdebruyn commented May 6, 2020

lawrencegripper commented May 6, 2020 •

edited

stikkireddy commented May 7, 2020

sdebruyn commented May 7, 2020

sdebruyn commented May 7, 2020

stikkireddy commented May 7, 2020

stikkireddy commented May 7, 2020

sdebruyn commented May 7, 2020

Issue #21 Fix client_secret special characters #27

Issue #21 Fix client_secret special characters #27

Conversation

stikkireddy commented May 1, 2020

stikkireddy May 4, 2020

Choose a reason for hiding this comment

sdebruyn commented May 4, 2020

lawrencegripper commented May 5, 2020 • edited

stikkireddy commented May 5, 2020

codecov-io commented May 5, 2020 • edited

Codecov Report

sdebruyn commented May 6, 2020

sdebruyn commented May 6, 2020

stikkireddy commented May 6, 2020

sdebruyn commented May 6, 2020

stikkireddy commented May 6, 2020

sdebruyn commented May 6, 2020

lawrencegripper commented May 6, 2020 • edited

stikkireddy commented May 7, 2020

sdebruyn commented May 7, 2020

sdebruyn commented May 7, 2020

stikkireddy commented May 7, 2020

stikkireddy commented May 7, 2020

sdebruyn commented May 7, 2020

lawrencegripper commented May 5, 2020 •

edited

codecov-io commented May 5, 2020 •

edited

lawrencegripper commented May 6, 2020 •

edited