Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Compile-time Target Architecture Tags #143

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Lancern
Copy link

@Lancern Lancern commented Aug 6, 2023

This PR resolves #118, which proposes compile-time target architecture tags. It contains a series of commits that implement the proposal. Problems may arise as I author these commits so I'm opening this PR in an early stage to make sure that we can agree on the specific designs in the end.

Here's a list of unresolved problems:

  • Currently, there is no CapstoneBuilder that builds Capstone instances whose target architectures are "really" unknown at compile-time. When we call functions like x86, we have explicitly tell that the target architecture is x86. We should implement a new builder that sets the target architecture via a runtime Arch value.
  • Currently, for many supported architectures, their RegId and InsnGroupId (some architectures even the InsnId) are simply defined as a bunch of u32 values inside a mod rather than a dedicated enum (e.g. ARM, ARM64, x86, etc.). We’d better re-define them to be dedicated enums so that defining ArchTag::RegId and ArchTag::InsnGroupId is meaningful.
  • After we support compile-time arch tags, many operations need not to return an Option or Result any more as they should always succeed given the exact target architecture. But for DynamicArchTag these operations may still fail so we need to keep their return types as it is now. How do we resolve the conflict?

Here's a TODO list to make this PR fully ready:

  • Update documentation.
  • Update examples, benches and tests.

This commit is the first commit among a series of patches that implement the
compile-time architecture tag proposal in capstone-rust#118. It contains the following major
changes:

- Add the ArchTag trait. This trait indicates a "tag" type that represents a
  specific architecture. Its associated types specify specific data types
  corresponding to the architecture it represents.

- Add ArchTag implementations for supported architectures, and a special
  DynamicArchTag that indicates the target architecture is unknown at compile
  time.

- Update existing general type definitions (e.g. Capstone, Insn, etc.) to
  acquire a general arch tag type parameter.
This commit changes the bindgen settings so that bindgen generates newtype enums
instead of modules of constants for arch insn group types and reg types. Usages
of these binding types are updated as well.
@tmfink
Copy link
Member

tmfink commented Aug 21, 2023

Currently, for many supported architectures, their RegId and InsnGroupId (some architectures even the InsnId) are simply defined as a bunch of u32 values inside a mod rather than a dedicated enum (e.g. ARM, ARM64, x86, etc.). We’d better re-define them to be dedicated enums so that defining ArchTag::RegId and ArchTag::InsnGroupId is meaningful.

I like your change which uses newtype_enum instead of the constified enum module! It's already a big improvement.

Bindgen didn't supported them when I first did this work. In fact, I contributed the "constified enum module" feature to bindgen (rust-lang/rust-bindgen#741) to avoid the less ergonomic constified enum.

However, I think we may be able to do even better (although it might be more annoying). Ideally, we would expose actual Rust enums in the high-level bindings. That way, users could take advantage of the exhaustive pattern matching with match. The Capstone C code makes sure that 0 is always an "invalid" value such as ARM_SYSREG_INVALID. When converting from the C value to the Rust value, we could map any invalid value to an "invalid" variant of the enum OR always return an Option<T> when converting and return None if we get an invalid value.

// Have internal "invalid" variant
enum ArmReg {
    Invalid = 0,
    APSR,
    APSR_NZCV,
    // ...
}
// No invalid variant, so converting gives `Option<ArmReg>`
enum ArmReg {
    APSR = 1,
    APSR_NZCV,
    // ...
}

You don't need to implement this "Rust enum" change as a part of this PR--I just thought I'd let you know what I was thinking.

@Lancern
Copy link
Author

Lancern commented Aug 21, 2023

Ideally, we would expose actual Rust enums in the high-level bindings.

Actually, I tried this at the beginning when writing the second commit. However, I quickly ran into problems.

Take ARM64 registers as an example. The problem is that there are actually 2 underlying C enums that both represent valid ARM64 reg IDs: arm64_reg and arm64_sysreg. On the Rust side, each architecture can only have one Rust enum that represents an architecture register. Unfortunately, bindgen cannot merge two C enums into one big Rust enum. Also converting Arm64Sysreg into Arm64Reg won't work because the underlying value ranges of Arm64Sysreg and Arm64Reg are different. Thus the problem cannot be solved without considerable manual efforts if we generate Rust enums solely.

On the other side, if we generate newtype enums for these two underlying C enums, we can easily convert an Arm64Sysreg into an Arm64Reg by manually implementing a From:

impl From<Arm64Sysreg> for Arm64Reg {
    fn from(sysreg: Arm64Sysreg): Self {
        Self(sysreg.0)
    }
}

This commit contains changes that make existing examples and benches compile
with the latest API design. Unit tests will be covered in later commits.

The primary changes include:
- Implement DetailArchInsn for ArchDetail.
- Add ArchOperandIterator enum that iterates values of ArchOperand inside an
  ArchDetail. The enum is basically a wrapper that holds all available arch-
  specific operand iterators in different variants.

This commit also contains some refactors and minor changes, including:
- Rename ArchDetail to ArchInsnDetail as the old name is a bit confusing.
- Fully add sysz support.
- Update existing examples and benches. In most cases, to migrate these existing
  examples and benches, I just need to add a small number of generic type
  arguments (e.g. X86ArchTag, ArmArchTag, DynamicArchTag, etc.) and the code
  just compiles.
This commit fixes a minor problem that prevents tests from compiling and running
to success.

Arch-specific register IDs (represented by ArchTag::RegId) and instruction group
IDs (represented by ArchTag::InsnGroupId) are generated as newtype structs whose
inner type is c_uint because their C definition are just C enums. However, in a
cs_detail struct, the regs_read field is an array of u16, not an array of
c_uint. So we cannot just type pun on the underlying array to get &[A::RegId]
because the layout is totally different. The similar problem exists for
InsnDetail::regs_write and InsnDetail::groups.

This commit fixes the problem by making these function return an Iterator rather
than a slice. The iterator will map the underlying array elements to actual
arch-specific types when iterated.
@codecov
Copy link

codecov bot commented Sep 17, 2023

Codecov Report

Merging #143 (7d8ab8c) into master (74ccb09) will decrease coverage by 7.49%.
Report is 1 commits behind head on master.
The diff coverage is 91.01%.

❗ Current head 7d8ab8c differs from pull request most recent head 5cd9258. Consider uploading reports for the commit 5cd9258 to get more accurate results

@@            Coverage Diff             @@
##           master     #143      +/-   ##
==========================================
- Coverage   93.99%   86.51%   -7.49%     
==========================================
  Files          22       18       -4     
  Lines        2733     1379    -1354     
  Branches     2687     1377    -1310     
==========================================
- Hits         2569     1193    -1376     
- Misses        164      186      +22     
Files Coverage Δ
capstone-rs/examples/demo.rs 90.38% <100.00%> (ø)
capstone-rs/examples/parallel.rs 95.23% <100.00%> (ø)
capstone-rs/src/arch/evm.rs 100.00% <ø> (+7.69%) ⬆️
capstone-rs/src/arch/m680x.rs 93.38% <100.00%> (-2.55%) ⬇️
capstone-rs/src/arch/ppc.rs 90.32% <100.00%> (-3.13%) ⬇️
capstone-rs/src/arch/tms320c64x.rs 89.91% <100.00%> (-7.52%) ⬇️
capstone-rs/src/instruction.rs 97.93% <100.00%> (+5.43%) ⬆️
capstone-rs/src/lib.rs 100.00% <ø> (ø)
capstone-rs/src/arch/mips.rs 67.74% <88.88%> (-28.26%) ⬇️
capstone-rs/src/arch/mod.rs 89.55% <75.00%> (-3.25%) ⬇️
... and 7 more

... and 4 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@tmfink
Copy link
Member

tmfink commented Oct 16, 2023

On the other side, if we generate newtype enums for these two underlying C enums, we can easily convert an Arm64Sysreg into an Arm64Reg by manually implementing a From:

Makes sense to me. If we want "handcrafted enums" that are more ergonomic, it would certainly be a separate PR.

@Lancern
Copy link
Author

Lancern commented Oct 16, 2023

I've updated some source code documentation to reflect the new architecture tag. However, I don't add a systematic overview or tutorial about the architecture tag as I'm not sure whether that's necessary. @tmfink Please have a look and see if the documentation can be further improved.

The examples and benches are already up-to-date with the latest API design. I'm checking whether the test coverage can be further improved and after that I belive this PR is ready to be formally reviewed and merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: add generic parameters to types to distinguish different data structures on different architectures
2 participants