Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect insertion of align_to_byte() / alignToByte() calls #1070

Open
10 tasks done
generalmimon opened this issue Oct 4, 2023 · 1 comment
Open
10 tasks done

Incorrect insertion of align_to_byte() / alignToByte() calls #1070

generalmimon opened this issue Oct 4, 2023 · 1 comment

Comments

@generalmimon
Copy link
Member

generalmimon commented Oct 4, 2023

I think it's safe to say that the biggest obstacle to using bit-sized integers is the incorrect insertion of align_to_byte() calls by the compiler. Instances of this problem have been reported dozens of times before, but none of the existing issues felt general enough to make it canonical, which is why I'm opening a new one. The following issues with checkboxes should be closed as duplicates of this issue:

  • Alignment options #12

    This one shouldn't be closed because it's in fact a broader issue, but people have also reported instances of the improper align_to_byte() insertion several times there, so I'm including it in this list.

  • un-aligned, bit-based data stream #576

    This is a broader issue as well - although the issue with align_to_byte() insertion is also mentioned there, it even proposes to make operations that are currently byte-aligned by design (u{X} and s{X} integers, floats, byte arrays) non-byte aligned, so technically it's not a pure duplicate.


The problem is that the align_to_byte() calls are inserted by the compiler, and its logic for deciding when to do this is very crude. Mainly:

  • it treats type switches and user-defined types as byte-aligned (which is certainly undesirable - I don't see any reason why you shouldn't be able to use a type switching or cross a type boundary at any position where the format you're describing requires it)
  • doesn't consider ifs, repeats etc.
@generalmimon
Copy link
Member Author

generalmimon commented Oct 13, 2023

On the one hand, you could say that the compiler implementation is very incomplete and you'd be right. On the other hand, I've realized over time that it would in fact be far from easy to make the compiler behave properly in all cases, because there are a lot of situations that you have to handle specifically, for example (#743 (comment)):

seq:
  - id: a
    type:
      switch-on: _index
      cases:
        0: b2
        1: u1
    repeat: expr
    repeat-expr: 2

This is effectively a b2 - u1 sequence at runtime, so you have to insert an align_to_byte() call between the two types. But from the perspective of implementing the alignment in the compiler, this case is quite unusual - usually you need 2 consecutive attributes to require the alignment, but here it's within the same attribute due to repetition. Of course, the above situation can be equivalently implemented via a user type with if-guarded fields, for example:

seq:
  - id: a
    type: item(_index)
    repeat: expr
    repeat-expr: 2
types:
  item:
    params:
      - id: i
        type: s4
    seq:
      - id: byte_val
        type: u1
        if: _index == 1
      - id: bit_val
        type: b2
        if: _index == 0

(I've intentionally defined byte_val before bit_val, because it doesn't matter given that only one is active at a time.)

This has the same effect in practice, but for the compiler this is a very different case, and this situation is just one of many. Somehow the compiler would have to see through all of this, which definitely sounds like an awful lot of work.


Instead, the approach I've already taken in Java and Python (Java: kaitai-io/kaitai_struct_java_runtime@1bc75aa, Python: kaitai-io/kaitai_struct_python_runtime@1cb84b8) is to handle all bit-byte alignment in the runtime library. It turned out that doing this at runtime is simply at least 20x easier than at compile time - and more reliable too. Consequently, it also makes the runtime library more suitable for standalone use, because now a call sequence like read_bits_int_be(3) - read_bytes(1) - read_bits_int_be(3) is not an undefined behavior anymore (due to the missing align_to_byte() call between the first and the second call), but will actually work as expected, issuing an alignment between the first two operations automatically.

The only potential disadvantage of handling this at runtime is, obviously, some runtime performance penalty (because essentially all byte-aligned reads call align_to_byte() now). I haven't done any benchmarks, but I don't think it will be significant in most cases because the align_to_byte() is very simple and I believe it can be well optimized (inlined etc.) by the compilers. And we in fact need it for correctness, so I think it's worth it.

Now, all that's left to do is to port the existing scheme from Java and Python runtimes to all the other languages and all align_to_byte() problems will hopefully be gone for good.

generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Nov 4, 2023
0.11 will definitely include some changes that require an updated
runtime - for example
kaitai-io/kaitai_struct_cpp_stl_runtime#67 or
the fact that (once the `serialization` branch is merged) the compiler
will no longer insert `align_to_byte()` calls and will instead rely on
the runtime libraries to do it:
kaitai-io/kaitai_struct#1070 (comment)).

Besides, we need to require something newer than 0.9 to fix the version
check in Python:
kaitai-io/kaitai_struct#804 (comment)
@generalmimon generalmimon changed the title Incorrect insertion of align_to_byte() calls Incorrect insertion of align_to_byte() / alignToByte() calls Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant