Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External Data Source With Multiple Parameters #27

Open
luckerby opened this issue Jan 15, 2019 · 8 comments
Open

External Data Source With Multiple Parameters #27

luckerby opened this issue Jan 15, 2019 · 8 comments

Comments

@luckerby
Copy link

Hi there,

I'm running Terraform v0.11.11, and I’m calling an external tool that takes a total of 3 parameters in the following order: a regex expression and 2 strings. Since the regex being passed can be quite complex, I need to have it enclosed in double quotes. The trouble is that when terraform invokes this tool, things get messed up, as follows.

Using cmd.exe:
image

The tools is invoked, however the escape characters are send as-is, while an extra double quote is added at the very end for some unknown reason:
image

Using Powershell:
image

The tool is invoked, however no quotes are placed around the first parameter, as I’d want to. There’s just one quote, but that corresponds to the one with which the whole command line starts with; since the double quotes are enclosing the tool’s filename, it’s not a problem. But the regex itself is not surrounded by anything:
image

I believe there are 2 things I need to get right – the string as entered in config.tf needs to be syntactically correct from terraform’s standpoint, then it also needs to be parsed appropriately by the shell I’m invoking. I’m somehow missing one or both.

@apparentlymart
Copy link
Member

Hi @luckerby,

Unfortunately on Windows systems the handling of command line arguments can get pretty complex and hard to predict. 😖

The root problem is that on Windows the command line of a process is just a single string, rather than an array of strings as it is on Unix. This means that any additional layer the command line passes through will define its own tokenization which can affect what is seen by the layer afterwards.

If I'm understanding correctly, your command line is being processed by the following layers:

  • Terraform passes the given array on to Go's os/exec package, which then passes it on verbatim to os.StartProcess.
  • Inside the Go runtime, the arguments are all quoted using EscapeArg, which in your case I think would produce the string cmd.exe "/c ExcelUpdateTool.exe \"regex\" machine_name available" .
  • The Go runtime passes this to the Windows API CreateProcess, which accepts that whole command line as a single string in the cmdline argument.
  • The whole string is then passed verbatim to cmd.exe, which does its own processing of the command line. Since we don't have access to the source code of cmd.exe we can't see exactly how that is done. Somehow it isolates the part after /c, which I assume must involve trimming off the first layer of quotes.
  • cmd.exe then itself calls CreateProcess. I don't know exactly what it passes; it might be this layer where your quotes are getting "lost", or it might be passing on the string ExcelUpdateTool.exe "regex" machine_name available, with the quotes intact.
  • ExcelUpdateTool.exe must then parse this command line string itself. How it does that is entirely up to the program, and generally depends on how the program is built. If it is a C program or written in a language that uses the MSVC runtime then it will probably use the C runtime's argument parser, which would cause it to see an array like ["ExcelUpdateTool.exe", "regex", "machine_name", "available"], thus losing those inner quotes in the process.

I've not found any way to navigate these sequences of joining and splitting arguments except by trial and error. It might be helpful to eliminate the command interpreter layer altogether to remove one level of processing:

data "external" "NetworkObtainedData" {
  program = ["ExcelUpdateTool.exe", "\"regex\"", "machine_name", "available"]
}

I wish I could give a more specific suggestion here, but command line parsing on Windows is always ultimately at the mercy of the program being run, and so the only way to truly know what is needed is to review the code of the programs in question and reverse-engineer how they process the command line.

I would suggest avoiding PowerShell because its own command line parsing rules are quite complex themselves, and so it is quite complex to get exact control over quoting passed to external programs. The Windows command interpreter (cmd.exe) is much simpler, and so it gets away with much more limited processing of the command line. Avoiding either of these by going directly to the windows CreateProcess API (which is the effect of my example above) puts the smallest number of processing steps between your Terraform configuration and the program being run.

@luckerby
Copy link
Author

luckerby commented Jan 18, 2019

Thank you for the detailed reply, Martin. The flow especially is really nicely detailed. I’ve made some more tests on my side, and added the results below.

TL;DR: Invoking the external tool you need, and specifying the parameters it requires directly in the terraform plan is the way to go. I’ve made the incorrect assumption that only 2 values are supported for the program attribute inside the external data source, since all the examples I myself encountered had this particularity. Although the documentation here, clearly states that’s not the case. I’ve constrained myself to use cmd.exe, when this wasn’t required at all.

Long version: I’ve tested a few ways of passing various parameter values, with and without cmd.exe. In case someone else runs into something similar in the future, maybe what they’re looking for will be here already.
One thing to note – in my case, it’s fortunate that the external tool itself being called is compiled C# code I’ve written, which is using the classic args method, so we can see next in detail what happens inside the tool, when the parameters are actually read.

Leaving where I picked off in my original post, back then I didn’t include the command line that cmd.exe is invoked under – I’ve done so this time however. Below you can see the terraform external provider invoking cmd.exe, which in turn invokes the tool:
image

So, following the trail, and expanding Process Explorer's “Command line” fields we have the following.
First the external provider invoking the tool, seen below as it's declared within the .tf file:

data "external" "NetworkObtainedData" {
      program = ["cmd.exe", "/c ExcelUpdateTool.exe \"regex\" terraform-firstMachine available" ]
}

Just as Martin correctly predicted, cmd.exe is started using:
cmd.exe "/c ExcelUpdateTool.exe \"regex\" terraform-firstMachine available"

It looks like the shell itself cmd.exe spawns the tool using a messed up syntax – note the quote at the end
ExcelUpdateTool.exe \"regex\" terraform-firstMachine available"

The debug output for the tool itself lists the parameters sent across just as we’d have wanted them – note that the backslashes are gone:
image

Let’s now take something more closer to real life – say a regex identifying all IP addresses starting with 192.168.4, which translates to ^192\.168\.4\..*$. The caret signals the beginning of the line, the backslashes are there to escape the dot (which has special meaning in regex constructs), the dot-star matches multiple characters and the dollar maps to the end of the string. Instead of just supplying a generic value directly in the .tf file, the regex itself is placed inside a file, which is read through yet another external provider, as follows. The original external provider invoking the tool is modified to read the aforementioned file.

data "external" "IPregexMatchPattern" {
    program = ["cmd.exe", "/c type c:\\Users\\malbert\\Desktop\\Excel2VM\\IPregexMatchPattern.txt"]
}
data "external" "NetworkObtainedData" {
      program = ["cmd.exe", "/c ExcelUpdateTool.exe \"${data.external.IPregexMatchPattern.result.IPregex}\" terraform-firstMachine available" ]
}

Through trial and error, I’ve discovered that the content of the (JSON) regex file itself has to contain double backslashes as below.
{"IPregex":"^192\\.168\\.4\\..*$"}

When cmd.exe itself is invoked, the string that reaches it is:
cmd.exe "/c ExcelUpdateTool.exe \"^192\.168\.4\..*$\" terraform-firstMachine available"

cmd.exe in turn invokes the tool using:
ExcelUpdateTool.exe \"^192\.168\.4\..*$\" terraform-firstMachine available"

What reaches the tool itself is the quoted regex, as per below. Within the tool, one would need to strip the quotes to get the original regex string.
image

I’m suspecting terraform itself is parsing the value in the JSON file and stripping each backslash from each doubled set, since Go’s own EscapeArg doesn’t have a rule for this, and by the time the string reaches cmd.exe, it’s only having single backslashes.

For the particular IP regex used above, it turns out that the quotes aren’t really necessary after all. So, stripping away the quotes from the external provider’s config:

data "external" "NetworkObtainedData" {
      program = ["cmd.exe", "/c ExcelUpdateTool.exe ${data.external.IPregexMatchPattern.result.IPregex} terraform-firstMachine available" ]
}

Again, through trial and error, it turns out that once the quotes above are gone, the caret in the regex needs to be doubled. Thus the input file containing the regex expression becomes:
{"IPregex":"^^192\\.168\\.4\\..*$"}

cmd.exe itself is invoked using:
cmd.exe "/c ExcelUpdateTool.exe ^^192\.168\.4\..*$ terraform-firstMachine available"

cmd.exe in turn invokes the tool (note one caret that disappears, probably due to the special meaning it has traditionally to the shell)
ExcelUpdateTool.exe ^192\.168\.4\..*$ terraform-firstMachine available"

This time the tool receives the regex just as we wanted it:
image

One last run to look how it all comes together when invoking the tool directly, without using cmd.exe. The input regex file is changed back as follows:
{"IPregex":"^192\\.168\\.4\\..*$"}

The external provider invoking the tool is amended so that the tool and the parameters are all specified separately:

data "external" "NetworkObtainedData" {
      program = ["ExcelUpdateTool.exe", "${data.external.IPregexMatchPattern.result.IPregex}", "terraform-firstMachine", "available" ]    
}

The invoke sequence is straightforward this time:
image

The arguments that reach the tool are as expected:
image

Note that the last run only removed – as Martin said – a single layer of the processing done, namely cmd.exe. All the other entries in his workflow still very much apply. Also – the results above were obtained with code compiled in C#, which most likely (assumption on my side here) follows the conventions for C/C++ Martin referenced here; if the language that tool is written is different, things will probably differ.

@apparentlymart
Copy link
Member

Thanks for following up with all that extra detail, @luckerby!

I was researching this situation recently for an unrelated project (outside of Terraform) so it was a nice coincidence that I happened to find this issue while these details were still fresh in my mind! All of what you've shown here makes sense and is consistent with what I saw in my own investigations.

I vaguely remember learning that the rules for .NET CLR applications Main(string[] args) is subtly different than the MSVC++ approach, but not so different that it causes significant problems.

I'm going to leave this open for the moment so we can think about whether there's a way to distill this into some general information for the docs. So far I've found it difficult to do this just because of how much variation there is depending on how programs are being launched, but I think at least we could say something about not using cmd.exe or powershell.exe and describe the special way we treat program list in order to turn it into a single string on Windows.

Thanks again for the extra notes here!

@luckerby
Copy link
Author

Probably the I-only-have-2-fields-available-to-invoke-a-tool mindset comes naturally to people with a Windows background (myself included) due to the tools we're usually working with.

First of all, Powershell scripts will usually get invoked using some scheduled task resembling what's seen below. You have to make sure that all those arguments are squeezed together on that second line.

image

Secondly, psexec.exe can be used to do all sorts of cool things. But there's a catch, running commands with parameters requires a special syntax, whereby cmd.exe has to be used, followed by a long line (string) of parameters (here, section Internal commands).

Now in a page containing documentation, most folks I know (myself included) will usually reach for the examples section, and don't really bother with the rest unless the thing they're trying to achieve won't work. A 2nd example in the external providers page that specifically uses more than 2 parameters will probably do the trick for most.

@apparentlymart
Copy link
Member

apparentlymart commented Jan 18, 2019

I think this tendency for splitting it into two elements "program" and "arguments" comes from how the Windows API itself works, where CreateProcess treats the first token in the command line in a special way to extract the program to run, but just passes everything else on verbatim to the program. The most intuitive mapping from that into a UI is two separate text fields, one of which often has the "Browse..." button next to it as shown here.

The separate project I was referring to earlier was my shquot Go package, where I had to introduce a special idea of "split" command lines to properly represent how PowerShell and other Windows tools think about command lines.

With that said, I think you're right that a different example could make the point effectively. I think it may still be worth describing exactly how those separate strings get combined into a single string on Windows, since that itself implies some particular assumptions about quoting that may not be initially obvious, but an example will draw attention to the fact that there can be more than two arguments here without any need to refer to the prose docs.

@rquadling
Copy link

No idea how relevant this is but I used to work with PHP on Windows at time when some of you were not even the idea of existence (Yes, I'm old).

One of the "difference" is the way to escape " on Windows.

Back then, we used the ^ to escape ".

Had to go hunting for a test on this : https://github.com/php/php-src/blob/ab5edb6a8e71eab4aa6d85953326548eb6d9c484/ext/standard/tests/streams/bug78883.phpt#L16

It MAY still be relevant and allow you to send the double quotes correctly so that they get picked up by cmd.exe appropriately.

YMMV.

@rquadling
Copy link

@apparentlymart
Copy link
Member

Thanks for sharing that, @rquadling!

I now have even less context about this in my working memory than I did last time I commented 🙃 but I can see in my go-shquot library that the ^ escape sequence seems to be for cmd.exe's own tokenizer, since I apparently implemented it in the WindowsCmdExe function.

My recollection from researching that is that if you're running a command through cmd.exe -- whether by literally typing it at a prompt or by using this cmd.exe /c switch -- then the command line is subjected first to command line tokenization, which detects things like the I/O redirection operators >/< and pipes | which the command interpreter itself is responsible for handling.

I did include replacing " with ^" in that ruleset, and so that agrees with what you've described in your comments.

An important thing to keep in mind when thinking about the behavior of the external data source is that it doesn't use any shells or command interpreters itself, and instead uses (on Windows) CreateProcessEx. However, as we saw in earlier discussion, it is possible to use the data source to run cmd.exe /c ..., in which case I expect it would be necessary for the argument after /c to use the ^ escaping for all of the cmd.exe metacharacters.

In that case then, perhaps a suitable recipe when using cmd.exe (which often isn't necessary) is:

  program = ["cmd.exe", "/c", "ExcelUpdateTool.exe ^\"regex^\" machine_name available"]

The external data source would then call CreateProcessEx with the arguments string set to /c "ExcelUpdateTool.exe ^\"regex^\" machine_name available", and then cmd.exe would tokenize the contents of the argument after /c using its normal rules and thus replace the ^" sequences with literal ".

(I've not actually tested this, so I can't say whether I'm making correct assumptions or not. If someone tries this and finds whether or not it works, it would be helpful to share what you tried and what happened when you tried it!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants