Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editing a binary #639

Open
peterwaller-arm opened this issue Apr 12, 2023 · 9 comments
Open

Editing a binary #639

peterwaller-arm opened this issue Apr 12, 2023 · 9 comments

Comments

@peterwaller-arm
Copy link

peterwaller-arm commented Apr 12, 2023

Hi,

With jq I'm used to being able to edit a json document in-situ:

$ echo '{"foo":0}' | jq '.foo = 42'
{
  "foo": 42
}

I figure you should be able to do the same sort of thing with fq with binaries, but if so, it's not documented how to do it exactly.

For example:

fq '.header.entry=0' a.out

... at the moment this prints out a json representation of the elf. The closest thing I saw in the documentation was the "Add query parameter to URL" in the screenshot at the beginning of the readme, but of course there is no toelf. tobytes reports: fq: value can't be a binary.

So, is there a way to edit fields and then dump it out in the original format, or is this not possible currently? Thanks!

@wader
Copy link
Owner

wader commented Apr 12, 2023

It's complicated :) fq at the moment have very limited editing support for "decode values", tourl/fromurl work on JSON so it's just normal jq beaviour. But it can do bits and bytes slicing and combine things together again into a binary. When you do .header.entry=0 the "decode value" will be first convert into JSON and then the update is done.

The reason the support is limited is a mix of lack of need myself for it and that it's complex for some of the format supported by fq. For a lot of formats it's not really clear what should happen on an update, encode with same encoding but what about encoding that are ambiguous like varints can encode value in many ways with differente sizes? should size be preserved/truncated? update checksums? fields that control number of entries in an array? also fq has support for sub buffers for demuxing/tcp reasssembly etc... yeah you see :) But maybe the "clearest" would be to just support updating a specific bit/byte range using some helpers bit-size/endian helpers etc.

And you can kind-of do this already using the slicing support, for example update .header.entry in an ELF:

# this assume the entry is 64 bit
$ fq '(.header.entry | tobytesrange) as $e | tobytes | [.[:$e.start],0,1,2,3,4,5,6,7,.[$e.stop:]] | tobytes | elf | .header' some_elf
# or to write it out to a file
$ fq '(.header.entry | tobytesrange) as $e | tobytes | [.[:$e.start],0,1,2,3,4,5,6,7,.[$e.stop:]] | tobytes' some_elf > changed

This uses slicing, just normal [start:stop] jq syntax, on bytes (there is also tobits/tobitsrange to use bit indexes) and "binary arrays" in fq (similar to iolist:s in erlang). So any array that include only these values can be convert to an binary (via tobytes/tobits).

  • 0-255 will be one byte
  • strings will be UTF-8 bytes
  • nested binary array
  • bits and bytes values

Also the difference between tobytes and tobytesrange is that the range-version "remembers" its source start/stop range.

That said all of this can probably be improved in many ways, let me know your ideas.

@peterwaller-arm
Copy link
Author

Yeah, that's really nice that you can use tobytesrange in that way -- definitely a missing recipe in the docs in the interim!

A next small step would be to provide an ergonomic way to inject bytes. overwritebytes($e, newvalue | asuint32) or whatever would be appropriate as syntactic sugar for the recipe you suggested above. I guess it gets fun when you have to consider all possible encodings and endiannesses and alike. overwritebytes could at least check the length of the bytes being inserted matches the range being inserted into.

The ultimate vision would be to be able to update any value in any format and then propagate that change to anything else in the binary that needs to be updated to make it semantically correct. I'm guessing though that this is difficult-to-impossible, in the most extreme case requiring essentially a recompilation of the binary (imagine for my use case(s) for ELF patching, changing the length of a string, which changes the offsets of everything else in a section, suddenly all absolute addresses referring to points after that string may need changing and those new addresses might not be representable anymore with the same sequences of instructions in the binary, which would need propagating and so-on and so on).

@wader
Copy link
Owner

wader commented Apr 13, 2023

Yeap you describe my current plan quite well, some kind of helpers for cut/stitch and encodings. I had some momentum and motivation for a while to work on it but think i got stuck at how to make it feel jq:y and how to not "pollute" the namespace to much lots of small functions etc. One thing i've thought about is that fq has some machinery already do to query rewrites so it's possible to "extend" the jq language a bit if that would help.

About ultimate visions i think you summerise the problem quite well also. You more or less end up having to writing a transmuxer, linker etc and one that should handle and preserve lots of strange things or should it "normalize"? :) I've thought some about different ways, i'll list them:

  • "symmetric"/"bi-directional" approach. Would probably require something declarative. I know kaitai struct is working on doing serialisation but it seems to be far from trivial and probably will have limitations. Also an issue with declarative is that some formats have logic that i think is very complicated to describe purely declarativly. For example how to describe sample ranges in an mp4 file and then also how to describe what metadata from a certain box that is needed to decode samples for a specific track and so on.

  • Support format specific encoders. Similar to how currently to_yaml etc work but instead produce binary. Possibly a decoder and encoder could share a common JSON "schema" somehow? this also have some questions how it should behave in regard to nested decoding, reassembly, symbolic mappings etc? if lots of details are needed maybe the JSON representation would become less usable. Maybe how "decode value" current work could be extended to help? a decode value is a jq value plus some metadata (bit range, source buffer, sym mappings etc)

Also with any of the approaches it needs to fit well with how jq works.

@ksa-real
Copy link
Contributor

ksa-real commented Jun 20, 2023

How about adding a "Big thing" TODO about binary modifications that may as well modify length?

@wader
Copy link
Owner

wader commented Jun 20, 2023

How about adding a "Big thing" TODO about binary modifications that may as well modify length?

Yeap that is good idea, maybe can link this issue also.

Could you clarify what you mean by "may as well modify length"? about if the modification changes the length of the thing being modified?

@ksa-real
Copy link
Contributor

I think I don't have a use case right now, but the idea is as following. Let's take fMP4 container. "ftyp" box contains a list of "brands". Assume adding a "brand" to this list of 4-char identifiers. This operation would change the length of the binary representation of the list. The box containing the list would also grow in size. The boxes that follow the "ftyp" box would change their position (start+=4). Basically, the idea is to allow this sort of manipulations: not just replace few bytes but also do inserts and deletes.

@wader
Copy link
Owner

wader commented Jun 20, 2023

Ok i see, yeah that would be nice but not sure how one would do it and i have thought about it quite a lot. For example in the mp4 case if the brands list change affect the size of the ftyp box then all boxes after it will move which in turn will most likely affect offsets in stco boxes etc and so on if it should still be playable. So to support that kind of thing my guess is that one would have to write an encoder per format that want to support it (nearly a mp4 muxer in this case). But there are other issues and ambiguities encoding creates also, should an encoder try to "preserve" number, string etc encodings that can encode the same value in multiple ways? or normalize? (varint for example), would assign to a field that has symbolic mapping do reverse map back? lots of questions :)

@ksa-real
Copy link
Contributor

Agree. Likely the whole thing would look like "manual muxer" specific for every format. Specificity in not an issue per se as every format is already custom. I was thinking about e.g. sidx box. In the above case it would become invalid, so one would need either to manually patch its values, or fq would parse the file, maintain internal representation including references, and write correct values of sidx during serialization. The latter would mean the sidx references are computable fields that don't fit well into the whole concept. The former approach may be practical in some cases. I guess the right way is to collect real-world use cases and go from that.

@wader
Copy link
Owner

wader commented Jun 21, 2023

Yeap collect use cases sounds good. I've mostly used the technique i mentioned in an comment above to stitch things together, i wonder if one would come up with some helper function(s) to make that easier

@wader wader mentioned this issue Jul 23, 2023
27 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants