Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_mm256_loadu_si256 is failed to be inlined for ABI issues #140

Open
usamoi opened this issue Mar 25, 2024 · 4 comments
Open

_mm256_loadu_si256 is failed to be inlined for ABI issues #140

usamoi opened this issue Mar 25, 2024 · 4 comments

Comments

@usamoi
Copy link

usamoi commented Mar 25, 2024

Inspired by rust-lang/rust#121960, I'm looking for SIMD intrinsics that are not inlined in generated code.

https://github.com/BurntSushi/aho-corasick/blob/master/src/packed/vector.rs#L19C1-L27C59:

/// # Safety
///
/// All methods are not safe since they are intended to be implemented using
/// vendor intrinsics, which are also not safe. Callers must ensure that
/// the appropriate target features are enabled in the calling function,
/// and that the current CPU supports them. All implementations should
/// avoid marking the routines with `#[target_feature]` and instead mark
/// them as `#[inline(always)]` to ensure they get appropriately inlined.
/// (`inline(always)` cannot be used with target_feature.)

It's not fully true: if you do not mark the routines with #[target_feature], LLVM will reject to inline them since it does not know if inlining causes ABI issues. So we need to use both #[target_feature] and #[inline(always)].

I find _mm256_loadu_si256 is failed to be inlined in my project and it also applies to the released cargo binary. I think it's another rustc bug at first but finally objdump leads me here.

Step to reproduce it:
Copy & paste the example in readme.

objdump ./target/release/play_rust -D --demangle | grep "core_arch"
    e0f1:       e8 7a fd 06 00          call   7de70 <core::core_arch::x86::xsave::_xgetbv>
0000000000029f70 <core::ptr::drop_in_place<&aho_corasick::packed::teddy::generic::Mask<core::core_arch::x86::__m128i>>>:
000000000002eba0 <core::ptr::drop_in_place<core::core_arch::x86::__m128i>>:
000000000002ebb0 <core::ptr::drop_in_place<core::core_arch::x86::__m256i>>:
000000000002f0d0 <<core::core_arch::x86::__m128i as core::fmt::Debug>::fmt>:
000000000002f120 <<core::core_arch::x86::__m256i as core::fmt::Debug>::fmt>:
0000000000031810 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,1_usize>>>:
0000000000031820 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,2_usize>>>:
0000000000031830 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,3_usize>>>:
0000000000031840 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,4_usize>>>:
   48651:       e8 7a 06 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   48678:       e8 53 06 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   486c6:       e8 05 06 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   486ea:       e8 e1 05 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   48744:       e8 87 05 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   48762:       e8 69 05 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   48861:       e8 6a 04 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   48888:       e8 43 04 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   488d4:       e8 f7 03 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   488f2:       e8 d9 03 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   489a5:       e8 26 03 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   489c3:       e8 08 03 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   48a51:       e8 7a 02 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   48a78:       e8 53 02 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   48ac6:       e8 05 02 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   48aea:       e8 e1 01 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   48b44:       e8 87 01 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   48b68:       e8 63 01 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   48bc2:       e8 09 01 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   48be0:       e8 eb 00 00 00          call   48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
0000000000048cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>:
000000000007de70 <core::core_arch::x86::xsave::_xgetbv>:
@BurntSushi
Copy link
Owner

It's not fully true: if you do not mark the routines with #[target_feature], LLVM will reject to inline them since it does not know if inlining causes ABI issues. So we need to use both #[target_feature] and #[inline(always)].

What part isn't "fully" true? You conclude by saying that both #[target_feature] and #[inline(always)] should be used, but that not only directly contradicts the comment you quoted, rustc will reject it.

LLVM will reject to inline them since it does not know if inlining causes ABI issues

It does though. Because the caller of these functions has to uphold the safety contract that it's calling them from within a context where the appropriate target features are enabled.

Copy & paste the example in readme.

This is not a reproduction. You didn't include the specific steps you went through to produce the executable. When you give repro steps, you should provide every step with the commands you're running. Like this:

$ mkdir -p i140

$ cat Cargo.toml
[package]
publish = false
name = "i140"
version = "0.1.0"
edition = "2021"

[dependencies]
aho-corasick = "1.1.3"
anyhow = "1.0.81"

[[bin]]
name = "i140"
path = "main.rs"

$ cat main.rs
use aho_corasick::{AhoCorasick, PatternID};

fn main() {
    let patterns = &["apple", "maple", "Snapple"];
    let haystack = "Nobody likes maple in their apple flavored Snapple.";

    let ac = AhoCorasick::new(patterns).unwrap();
    let mut matches = vec![];
    for mat in ac.find_iter(haystack) {
        matches.push((mat.pattern(), mat.start(), mat.end()));
    }
    assert_eq!(
        matches,
        vec![
            (PatternID::must(1), 13, 18),
            (PatternID::must(0), 28, 33),
            (PatternID::must(2), 43, 50),
        ]
    );
}

$ cargo build --release
   Compiling anyhow v1.0.81
   Compiling memchr v2.7.1
   Compiling aho-corasick v1.1.3
   Compiling i140 v0.1.0 (/home/andrew/tmp/issues/aho-corasick/i140)
    Finished `release` profile [optimized] target(s) in 1.86s

Okay, and now I can try running your command:

$ objdump ./target/release/i140 -D --demangle | grep "core_arch"
    e0f1:       e8 7a fd 06 00          call   7de70 <core::core_arch::x86::xsave::_xgetbv>
0000000000029f70 <core::ptr::drop_in_place<&aho_corasick::packed::teddy::generic::Mask<core::core_arch::x86::__m128i>>>:
000000000002eba0 <core::ptr::drop_in_place<core::core_arch::x86::__m128i>>:
000000000002ebb0 <core::ptr::drop_in_place<core::core_arch::x86::__m256i>>:
000000000002f0d0 <<core::core_arch::x86::__m128i as core::fmt::Debug>::fmt>:
000000000002f120 <<core::core_arch::x86::__m256i as core::fmt::Debug>::fmt>:
   31541:       e8 7a 06 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   31568:       e8 53 06 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   315b4:       e8 07 06 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   315d2:       e8 e9 05 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   31685:       e8 36 05 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   316a3:       e8 18 05 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   31731:       e8 8a 04 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   31758:       e8 63 04 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   317a6:       e8 15 04 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   317ca:       e8 f1 03 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   31824:       e8 97 03 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   31842:       e8 79 03 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   31941:       e8 7a 02 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   31968:       e8 53 02 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   319b6:       e8 05 02 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   319da:       e8 e1 01 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   31a34:       e8 87 01 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   31a58:       e8 63 01 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   31ab2:       e8 09 01 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
   31ad0:       e8 eb 00 00 00          call   31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
0000000000031bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>:
0000000000031ef0 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,1_usize>>>:
0000000000031f00 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,2_usize>>>:
0000000000031f10 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,3_usize>>>:
0000000000031f20 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,4_usize>>>:
000000000007de70 <core::core_arch::x86::xsave::_xgetbv>:

I don't really know what to make of this. It does seem to suggest that it isn't getting inline somewhere, but it doesn't say where. And in particular, if this were a systemic problem, something like "load a vector from memory" not being inlined would be very apparent on a profile. Indeed, in profiling, the routine is marked as inlined:

inlined

I got that profile by using perf on Linux after building and running aho-corasick-debug (found in the root of this project's repository):

$ cargo install --path aho-corasick-debug
$ aho-corasick-debug patterns quarter.txt --match-kind leftmost-first

You can use whatever kind of haystack you want for this. Just make it a big one. In my case, quarter.txt is 3GB and is generated from the OpenSubtitles 2018 data set. You can get the full data file here: https://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.en.gz.

One possibility is that this function (via the generic Vector::load_unaligned) is being used somewhere where the relevant target feature attributes haven't been properly applied. Perhaps when building the Teddy searcher. In that case, LLVM wouldn't inline it (because of the ABI problem). The function will still work correctly AIUI, but it won't be inlined. And since it would only happen once at construction time, its performance overhead would be non-existent and thus hard to see.

So... what made you report this issue? Are you hitting a real problem? Or are you fishing for one? If the former, please share the real problem you're hitting. If the latter, I'm happy to have this fixed, but it isn't something I'm keen on spending time on.

@usamoi
Copy link
Author

usamoi commented Mar 25, 2024

One possibility is that this function (via the generic Vector::load_unaligned) is being used somewhere where the relevant target feature attributes haven't been properly applied. Perhaps when building the Teddy searcher. In that case, LLVM wouldn't inline it (because of the ABI problem). The function will still work correctly AIUI, but it won't be inlined. And since it would only happen once at construction time, its performance overhead would be non-existent and thus hard to see.

Inlining fails on https://github.com/BurntSushi/aho-corasick/blob/master/src/packed/teddy/generic.rs#L1254. Since you think it's not a problem, I'm closing this issue.

So... what made you report this issue? Are you hitting a real problem? Or are you fishing for one?

I think I'm just reporting a potential problem since I'm not a user of this package. If you call it fishing, I'm fishing.

@usamoi usamoi closed this as completed Mar 25, 2024
@BurntSushi
Copy link
Owner

To be clear, fishing is fine. It's just important context to understand what is driving things here.

I'll re-open this. I didn't mean to say it wasn't a problem, just not one that I'll like prioritize myself to fix.

@BurntSushi BurntSushi reopened this Mar 25, 2024
@BurntSushi
Copy link
Owner

Inlining fails on https://github.com/BurntSushi/aho-corasick/blob/master/src/packed/teddy/generic.rs#L1254.

Yeah indeed. It looks like going through array.map(...) fouls things up here. Sigh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants