Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to match regular expressions #55

Open
cjdb opened this issue Apr 24, 2023 · 12 comments
Open

Add a way to match regular expressions #55

cjdb opened this issue Apr 24, 2023 · 12 comments

Comments

@cjdb
Copy link

cjdb commented Apr 24, 2023

I'm working on a project that wants to use jd to check our program output, but there are some fields that contain absolute paths. In order to have a robust test suite, we only hardcode the relative path, starting from the project directory, and fill the rest in with a regular expression. For example:

"filePath": "{{.*}}/path/to/file.cpp"

Our current tool interprets {{.*}} as a regular expression. Sadly, it's not a JSON diff tool, so we need to do some other, unreadable things to get it to work (hence the desire to use jd).

Being able to facilitate regular expressions in jd would be wonderful. Is there any appetite for such an extension?

@josephburnett
Copy link
Owner

That's an interesting idea! Would you want to apply the regular expression to a single file / folder at a time, or a larger piece of the path? E.g. would your example "{{.*}}/path/to/file.cpp" match @ ["foo", "bar', "path", "to", "file.cpp"] or just @ ["foo", "path", "to", "file.cpp"]?

@josephburnett
Copy link
Owner

josephburnett commented Apr 27, 2023

If you can provide a simple example of two files and what diff output you want, it would help me understand the specific behavior you're looking for. And to determine how much of an extension it is.

@cjdb
Copy link
Author

cjdb commented Apr 27, 2023

Thanks for your interest! Here are a few examples. The LHS is controlled, and the RHS is independent.

LHS:

{
  "length":{{[0-9]+}},
  "location":{
  "index":0,
    "uri":"file://{{.*}}clang/test/Frontend/sarif-diagnostics.cpp"
  },
  "mimeType":"text/plain",
  "roles":[
    "resultFile"
  ]
}

RHS (no diff):

{
  "length":100,
  "location":{
  "index":0,
    "uri":"file:///tmp/clang/test/Frontend/sarif-diagnostics.cpp"
  },
  "mimeType":"text/plain",
  "roles":[
    "resultFile"
  ]
}

RHS (length is a string instead of a number):

{
  "length":"100",
  "location":{
  "index":0,
    "uri":"file:///tmp/clang/test/Frontend/sarif-diagnostics.cpp"
  },
  "mimeType":"text/plain",
  "roles":[
    "resultFile"
  ]
}
@ ["length"]
- {{\d+}}
+ "100"

RHS (uri isn't prefixed with "file://"):

{
  "length":100,
  "location":{
  "index":0,
    "uri":"/tmp/clang/test/Frontend/sarif-diagnostics.cpp"
  },
  "mimeType":"text/plain",
  "roles":[
    "resultFile"
  ]
}
@ ["location","uri"]
- "file://{{.*}}/clang/test/Frontend/sarif-diagnostics.cpp"
+ "/tmp/clang/test/Frontend/sarif-diagnostics.cpp"

@josephburnett
Copy link
Owner

josephburnett commented Apr 28, 2023

I had never considered using jd this way! You are wanting to define equality in a very flexible way. This is bordering on defining a schema. E.g. Json Schema. If it "matches" then it's equal. Otherwise there is a difference.

JSON as a format lacks metadata--a way of attaching information directly to values--data about data. Other, similar data formats do, such as Ion Annotations and EDN tags. So we have to either use flag values (special keys and values, like "$ref") or pass metadata through a side channel (like a separate file or additional input).

Your example isn't valid JSON ("length":{{[0-9]+}}) but we could treat it like a flag value: "length":"{{[0-9]+}}" (add quotes and treat values starting with {{ as "special"). But that limits the scope of the solution and introduces a lot of special syntax. So let's explore metadata in a side channel first.

The jd library and format can accept and encode metadata in the path to a value. It's a way to point to a part of the structure and say "interpret this data in this way". When making a diff you have to provide it out-of-band with the --set or --keys flags. Or when using jd as a library, as a Metadata parameter. However the diff will encode the metadata in the path. So you don't need to provide the metadata when applying a diff.

Example:

@ ["roles", ["set"], {}]
+ "notResultFile"

The ["set"] is read as metadata on the next value in the path.

Or this for short.

@ ["roles", {}]
+ "notResultFile"

The existing metadata is about how to interpret collections (arrays and objects). But you want to provide metadata about how to interpret leave nodes--individual values. "Two values are the same if they are both numbers." "Two values are the same if they are strings and have the same ending". This could easily extend to "two values are the same if they are a string representing the same datetime" (but different timezones).

So we need a powerful way to express a binary function for equality. You are proposing regular expressions which is solid option. They are well known, compact and powerful. Good for inlining into another data format. We just need to provide these expressions in a side-channel, either a file or as commandline parameters. So we could do something like this:

jd lhs.json rhs.json --metadata='["location","uri",{"regex":"file://{{.*}}clang/test/Frontend/sarif-diagnostics.cpp"}]'

Applied to these two files would return no diff:

{
  "location":{
    "uri":"file:///tmp/clang/test/Frontend/sarif-diagnostics.cpp"
  }
}
{
  "location":{
    "uri":"file:///my-temporary-folder/clang/test/Frontend/sarif-diagnostics.cpp"
  }
}

Or the metadata could be provided as a separate file. Like this:

[
  ["location","uri",{"regex":"file://{{.*}}clang/test/Frontend/sarif-diagnostics.cpp"}],
  ["length",{"regex":"\\d+"}]
]

(I have a strong preference for using valid JSON as much as possible)

The LHS has a concrete value which falls with the "schema". But that's kinda weird because you don't really care what the concrete value is. Just that it matches the metadata. Just that it falls within the "schema". So why not just use an existing schema language to set constraints?

Maybe your LHS should be something like this:

{
  "type": "object",
  "properties": {
    "length": {
      "type": "integer"
    },
    ...
  "required": [ "length", ... ]
}

Maybe you should be using a tool that validates JSON Schema instead of jd. Or maybe we should start thinking of jd as a terse schema validator and add validation inline per your original suggestion.

Let me ask some follow-on questions to understand your use case better. How do you plan to maintain and update the LHS? Will you have a source-of-truth JSON file that you update from time-to-time? Or will you write it carefully by hand as validation for the RHS? Will you start with a strict match and make parts "fuzzy" selectively (like the path)? Or will it be primarily composed of regular expressions?

@josephburnett
Copy link
Owner

josephburnett commented Apr 30, 2023

Another odd thing about embedding regular expressions in the LHS would be that you repeat them (denormalize) throughout a list.

Example:

{
  "locations":[
    {
      "index":0,
      "uri":"file://{{.*}}clang/test/Frontend/sarif-diagnostics.cpp"
    },
    {
      "index":9,
      "uri":"file://{{.*}}clang/test/Frontend/another-diagnostics.cpp"
    },
    {
      "index":99,
      "uri":"file://{{.*}}clang/test/Frontend/yet-another-diagnostics.cpp"
    }
  ]
}

This example might not make sense for you use case, but I could easily see a case where we have a list of the same type of object. A schema would give that object a name and define it's shape once. But using jd with in-line regular expressions, we would have to repeat the "type" definition over and over.

A second odd thing about using inline regular expressions would be telling the difference between types. Your example wants a number for length. Definitely not a string. So you embedded the regular expression {{[0-9]+}} to match a JSON number. But you really should have {{[0-9]+\.[0-9]*}} because numbers in JSON are floats. And so on. Before long you've written a JSON parser with regular expressions in order to read and constrain the value of the RHS.

Instead we should just say what type we want. Which again points to out-of-band (not inline) metadata describing the shape.

So your specific use case sounds a lot more like a job for JSON schema. Schema files can be pretty verbose so a good way to get started would be to use a schema generator (Google json schema generator) which will turn your golden LHS into a schema file.

You still need to handle the file prefix problem, a problem for which JSON schema doesn't have native support. But most schema validators allow you to provide a custom validation function and you could have a "file" function that you parameterize with the suffix.

For that matter, we could add some custom function hooks into the jd library so you could provide the same kinds of validation function. You would need to write them in golang since that's the jd library language unless you want to run jd as WASM (which is totally possible--that's how the UI works).

But I don't want to add such custom functions into jd natively (in repo, built into the binary, part of the jd diff format) for two reasons: /1/ it's departing from the "do one thing well" principle and getting pretty far into the schema validation space, for which there are better tools and /2/ I've maintained a pretty strict round-trip invariant that all diffs can also be applied as patches to produce the original input. So LHS + RHS = DIFF && RHS + DIFF = LHS. Regular expressions would break this property. I haven't mentioned the second one but I think it's quite important.

So in conclusion, you could still go either way: /1/ use JSON schema to validate the RHS (using a generator on a golden LHS, then writing custom schema functions) or /2/ extending jd to accept custom equality functions, then using the jd library to build a different tool (I can help you with this). The answer depends on the details of your use case (questions above) and your appetite for building new tooling. What do you think?

@josephburnett
Copy link
Owner

I was thinking about your [0-9]+ regular expression which pretty much says "I don't care what this number is". And it reminded me of a feature I was playing with some time ago: path masking. It was diffing Kubernetes objects where some of the fields are autogenerated: https://github.com/josephburnett/jd/blob/b5d115ce58b246ab63782c290386972a3bd28b95/README.md#see-what-changes-in-a-kubernetes-deployment

Do you really care if the length changes type? Would you be okay saying "Ignore length entirely"?

I'm leaning more toward building some of this into jd because it aligns well with the path masking feature. And one of the masks could be a regular expression.

Also sorry for the gigantic response. It's just helpful to externalize my though process. 😉

@cjdb
Copy link
Author

cjdb commented May 1, 2023

Maybe you should be using a tool that validates JSON Schema instead of jd.

Possibly! We need a simple diff for the vast majority of our output: it's just paths and lengths that get regexed.

How do you plan to maintain and update the LHS? Will you have a source-of-truth JSON file that you update from time-to-time?

Our LHS is the expected SARIF that we expect Clang to output (i.e. it's a test case). The test is embedded in the source file as the source of truth (here's an example of what it would look like*). Although we could technically update "length", that's extremely brittle because it tracks the number of bytes in the file. The "uri" fields are absolute, so we need to regex those.

*The RUN line on line 6 would feed the test input to jd, stripping away the C++ness of the file.

@cjdb
Copy link
Author

cjdb commented May 1, 2023

Also sorry for the gigantic response. It's just helpful to externalize my though process. 😉

No worries! I got distracted on Friday, so I appreciate your patience :)

Do you really care if the length changes type? Would you be okay saying "Ignore length entirely"?

We don't care about the length at the moment because I only recently learnt that our current (non-JSON-friendly) tool supports regex, so this isn't a tall order.

/1/ use JSON schema to validate the RHS (using a generator on a golden LHS, then writing custom schema functions) or /2/ extending jd to accept custom equality functions, then using the jd library to build a different tool (I can help you with this). The answer depends on the details of your use case (questions above) and your appetite for building new tooling. What do you think?

Good question. I'll mull this one over and get back to you in the next day or so (though if path masking gets added, that may suffice). Very much appreciate how much time and thought you've put into this, thank you :)

@cjdb
Copy link
Author

cjdb commented May 1, 2023

A third option is for jd to output a diff as it does today, and then that diff could undergo some postprocessing using Unix tools like grep, wc, head, and tail. I'll try this out today and let you know if that works well enough (and also share the script for others to use if it does).

@cjdb
Copy link
Author

cjdb commented May 2, 2023

There are apparently some infra reasons we can't use Go, so unfortunately jd won't be used. Since jd is pretty much perfect for our use-case, I'm planning to make a port so we get the functionality.

Thanks for all your assistance on this issue, and for making a really cool utility!

@cjdb cjdb closed this as completed May 2, 2023
@josephburnett
Copy link
Owner

@cjdb I'm glad the tool is useful and I'm sorry you can't use Go! Keep me posted if you do create a port. I would love to share ideas and keep them compatible. I'm in the process of implementing a 2.0 version of the format. I need to make some backward incompatible changes to add context for producing minimal diffs: #50

@josephburnett
Copy link
Owner

I'm reopening this issue because it's a feature that is still useful. Even if the requestor doesn't need it anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants