New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ISSUE] UnknownWorkerEnvironmentException when creating a cluster after creating a workspace #33
Comments
@sdebruyn this comment https://github.com/databrickslabs/databricks-terraform/pull/27/files#r419517843 should contain information about how to temporarily address this. This ties back to issue #21. |
We're also seeing this in recent tests, I can validate that a retry like the one added in #27 should resolve the issue as I did something similar in a hacked shell provider doing cluster creation here. |
@lawrencegripper so does retrying the api fix the issue? So what is the timeout value (like 20-30 minutes) or an infinite retry? I was thinking that extending the retry logic to a longer time out should work. |
Yeah retrying does fix the issue for me in all the testing I've done. Looks like the fix you put in on the PR should do the trick, I prefer your 30min timeout to my infinite loop as if things haven't worked after 30mins they're unlikely to work ever! Nice work on the workaround, I'll try and give this a test tomorrow. |
I just now got this issue again, the timeout might not be long enough
|
Yeah I'm also seeing this error too today, going to take a look. Looks like the cluster create needs retry logic too |
So my working theory here is that the workspace cluster api can go through two states, both related to this bug but both returning different error messages:
At some point the workspace goes from The current retry logic is handling for the 1st of these but not the second. We can either:
Just having more of a play now to try and repro this and prove the different error messages but it's slow going as they're super intermittent for me. Think I'm going to write a script which spins up workspace in parallel and polls the clusters endpoint straight after creation to log the results, strip all the rest of the stuff out of the mix. |
So here is my experiments results. I created this script which creates a Workspace then polls it (well it does 5 at the same time just to compare) Here is an example, when polling I don't see any of the other error of form My working theory is that these are returned from a different call type and never come back from the I'm going to rerun the script with |
No difference with the |
So I've got an answer, shortly after creating a workspace while it's listed as The period of time is not continuous. At points an error will occur, appear to resolve then reoccur again. My guess would be some form of load balancer is sending the requests to various nodes in a webfarm/cluster some of which have an updated view of the new workspace some of which don't. As well as this different endpoints will independently error with different messages.
At some points I've updated the script here to better track this, an example output is here tracking the time and the response to the various calls. As the current workaround is checking
Proposed fixIt's not pretty but we could keep the current Additionally to protect against the errors on calls like What do you think? Happy to look at picking these changes up and testing out. With the work we have in #51 to automate full environment creation (workspace and all) it's easier to reproduce. |
Since you suggested that it might somehow be a load balancer that redirects to different instances, maybe #34 might be helpful since then it would immediately go to the right instance? |
Interesting, will adapt the test script and re-run to see if using the Small world moment, the issue in AzureRM about adding adding an attribute to allow Azure Terraform to output the workspace URL is one I raised and was hoping to go fix, hashicorp/terraform-provider-azurerm#6732, it's currently blocked as it requires an SDK change to come it but after that hopefully good to go, not sure on the timelines tho. |
@sdebruyn sadly using the direct workspace URL doesn't make a difference. Below you can see the highlighted lines which have succeeded in calling Given this what do you think of the proposed fix? |
One additional test I ran quickly was to validate what error code is being returned by each error. In both instances the error is a Here is the full response in both scenarios
Having thought overnight on this more I think the best option is to move the logic in Sound like a good plan? |
I agree with you. Meanwhile I'm hoping that someone from the Azure team picks this up and update the provisioning state accordingly. That would avoid all these workarounds. |
As a workaround in my Python API client library, I've implemented this: -- Usage with a newly provisioned workspace When the first user logs it to a new Databricks workspace, workspace provisioning is triggered, and the API is not available until that job has completed (that usually takes under a minute, but could take longer depending on the network configuration). In that case you would get an error such as the following when calling the API: "Succeeded{"error_code":"INVALID_PARAMETER_VALUE","message":"Unknown worker environment WorkerEnvId(workerenv-4312344789891641)"} |
@stikkireddy are you ok to re-open this issue? |
Hi @algattik - thanks for your comment. In the discussion above, @lawrencegripper has put the results from his testing. Unfortunately, we're seeing that we can get a successful call to the API followed by a failing call :-( |
@stuartleeks hey reopened this issue, @lawrencegripper might be worth using the retryablehttp client in the client provider, but this will be more complicated to retry as it throws a standard 400 bad request instead of a server side 5xx error. |
I think the retryablehttp will fit in quite well. Have been looking at that this afternoon. Need to do some tidying up and then try it out 😃 |
Closing this as @stuartleeks was kind enough to implement the retryablehttp client as the base client implementation for the databricks go sdk. It has a set of transienterrors that it retries upon seeing and any future issues regarding this with a new error message will be added into that list of errors that we need to retry upon. |
Terraform Version
Affected Resource(s)
Terraform Configuration Files
Same one as in #21
Debug Output
Expected Behavior
After creating the workspace, we should be able to create the cluster during the same apply run.
Actual Behavior
When you create a workspace and terraform goes on to immediately create a cluster, you get the mentioned exception. It works when you apply a second time after a few seconds.
Steps to Reproduce
terraform apply
References
This third party databricks provider has the same issue
The text was updated successfully, but these errors were encountered: