Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profile runtime and optimize performance #31

Closed
williballenthin opened this issue Aug 3, 2023 · 9 comments
Closed

profile runtime and optimize performance #31

williballenthin opened this issue Aug 3, 2023 · 9 comments

Comments

@williballenthin
Copy link
Contributor

GoReSym is pretty slow and this makes it difficult to deploy at a large scale. Despite being written in Go and compiled to native code, it may take seconds or longer to process input files. For example, to run the entire test suite, it takes many minutes.

Given that Go programs can execute in a split second and their runtime uses all the metadata that GoReSym parses, it should be possible for GoReSym to run in a split second too, at least in the common case. (I understand that in obfuscated binaries that some scans may take a bit longer).

We should profile GoReSym to identify the slow parts of the algorithm and optimize their implementation. Our goal should be for GoReSym to finish within 0.10s for the common case.

@williballenthin
Copy link
Contributor Author

williballenthin commented Aug 3, 2023

to profile using the tests:

go get github.com/pkg/profile

and then add the following to TestAllVersions:

defer profile.Start(profile.ProfilePath(".")).Stop()

via: https://hackernoon.com/go-the-complete-guide-to-profiling-your-code-h51r3waz

@williballenthin
Copy link
Contributor Author

here's cpu.pprof from running tests on #30
cpu.zip

@williballenthin ➜ /workspaces/GoReSym (fix/issue-29) $ pprof -cum -top cpu.pprof 
File: GoReSym.test
Type: cpu
Time: Aug 3, 2023 at 9:29pm (UTC)
Duration: 352.86s, Total samples = 354.07s (100.34%)
Showing nodes accounting for 331.39s, 93.59% of 354.07s total
Dropped 581 nodes (cum <= 1.77s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%    335.93s 94.88%  testing.tRunner
         0     0%     0%    335.91s 94.87%  github.com/mandiant/GoReSym.TestAllVersions.func1
         0     0%     0%    335.91s 94.87%  github.com/mandiant/GoReSym.main_impl
         0     0%     0%    295.12s 83.35%  github.com/mandiant/GoReSym/objfile.(*Entry).PCLineTable
         0     0%     0%    295.12s 83.35%  github.com/mandiant/GoReSym/objfile.(*File).PCLineTable (inline)
         0     0%     0%    289.02s 81.63%  github.com/mandiant/GoReSym/objfile.findModuleInitPCHeader
         0     0%     0%    288.37s 81.44%  rsc.io/binaryregexp.(*Regexp).FindAllIndex
         0     0%     0%    288.37s 81.44%  rsc.io/binaryregexp.(*Regexp).allMatches
         0     0%     0%    288.37s 81.44%  rsc.io/binaryregexp.(*Regexp).doExecute
    33.40s  9.43%  9.43%    288.28s 81.42%  rsc.io/binaryregexp.(*machine).match
    79.41s 22.43% 31.86%    183.96s 51.96%  rsc.io/binaryregexp.(*machine).step
         0     0% 31.86%    169.58s 47.89%  github.com/mandiant/GoReSym/objfile.(*elfFile).pcln
         0     0% 31.86%    169.57s 47.89%  github.com/mandiant/GoReSym/objfile.(*elfFile).pcln_scan
   131.32s 37.09% 68.95%    150.59s 42.53%  rsc.io/binaryregexp.(*machine).add
         0     0% 68.95%     95.43s 26.95%  github.com/mandiant/GoReSym/objfile.(*machoFile).pcln
         0     0% 68.95%     95.43s 26.95%  github.com/mandiant/GoReSym/objfile.(*machoFile).pcln_scan
         0     0% 68.95%     29.15s  8.23%  github.com/mandiant/GoReSym/objfile.(*peFile).pcln
         0     0% 68.95%     29.14s  8.23%  github.com/mandiant/GoReSym/objfile.(*peFile).pcln_scan
         0     0% 68.95%     28.55s  8.06%  github.com/mandiant/GoReSym/objfile.(*Entry).ModuleDataTable
         0     0% 68.95%     28.55s  8.06%  github.com/mandiant/GoReSym/objfile.(*File).ModuleDataTable
    27.90s  7.88% 76.83%     27.90s  7.88%  runtime.memmove
         0     0% 76.83%     19.57s  5.53%  github.com/mandiant/GoReSym/debug/elf.(*File).DataAfterSection
     0.04s 0.011% 76.84%     19.01s  5.37%  runtime.systemstack
         0     0% 76.84%     18.39s  5.19%  github.com/mandiant/GoReSym/objfile.(*elfFile).moduledata_scan
     0.06s 0.017% 76.86%     17.52s  4.95%  runtime.growslice
     1.96s  0.55% 77.41%     14.10s  3.98%  rsc.io/binaryregexp/syntax.(*Inst).MatchRune (inline)
     0.75s  0.21% 77.62%     13.27s  3.75%  runtime.mallocgc
    12.13s  3.43% 81.05%     12.14s  3.43%  rsc.io/binaryregexp/syntax.(*Inst).MatchRunePos
     0.01s 0.0028% 81.05%     11.58s  3.27%  github.com/mandiant/GoReSym/objfile.(*Entry).ParseTypeLinks
         0     0% 81.05%     11.58s  3.27%  github.com/mandiant/GoReSym/objfile.(*File).ParseTypeLinks (inline)
     0.02s 0.0056% 81.06%     11.33s  3.20%  github.com/mandiant/GoReSym/objfile.(*Entry).ParseType
     0.01s 0.0028% 81.06%     11.21s  3.17%  runtime.gcBgMarkWorker
     0.14s  0.04% 81.10%     11.18s  3.16%  github.com/mandiant/GoReSym/objfile.(*Entry).ParseType_impl
     0.02s 0.0056% 81.11%     10.63s  3.00%  os.(*File).ReadAt
     0.03s 0.0085% 81.11%     10.61s  3.00%  os.(*File).pread (inline)
     0.04s 0.011% 81.13%     10.58s  2.99%  internal/poll.(*FD).Pread
     0.07s  0.02% 81.14%     10.58s  2.99%  io.(*SectionReader).ReadAt
     0.04s 0.011% 81.16%     10.35s  2.92%  syscall.Syscall6
         0     0% 81.16%     10.34s  2.92%  syscall.Pread (inline)
         0     0% 81.16%     10.34s  2.92%  syscall.pread
     9.87s  2.79% 83.94%      9.87s  2.79%  runtime/internal/syscall.Syscall6
     0.01s 0.0028% 83.95%      9.81s  2.77%  syscall.RawSyscall6
     0.01s 0.0028% 83.95%      9.69s  2.74%  runtime.gcBgMarkWorker.func2
     0.01s 0.0028% 83.95%      9.68s  2.73%  github.com/mandiant/GoReSym/debug/macho.(*File).DataAfterSection
     0.05s 0.014% 83.97%      9.68s  2.73%  runtime.gcDrain
     0.01s 0.0028% 83.97%      9.44s  2.67%  github.com/mandiant/GoReSym/objfile.(*machoFile).moduledata_scan
     3.57s  1.01% 84.98%      8.42s  2.38%  runtime.scanobject
     0.04s 0.011% 84.99%      7.46s  2.11%  runtime.makeslice
     7.06s  1.99% 86.98%      7.07s  2.00%  rsc.io/binaryregexp.(*inputBytes).step
     6.81s  1.92% 88.91%      6.81s  1.92%  runtime.memclrNoHeapPointers
     6.67s  1.88% 90.79%      6.72s  1.90%  rsc.io/binaryregexp.(*machine).alloc (inline)
     0.02s 0.0056% 90.80%      5.74s  1.62%  github.com/mandiant/GoReSym/debug/elf.(*Section).Data
     0.08s 0.023% 90.82%      4.72s  1.33%  github.com/mandiant/GoReSym/objfile.(*Entry).readRTypeName
         0     0% 90.82%      4.70s  1.33%  runtime.markroot
     0.06s 0.017% 90.84%      4.21s  1.19%  runtime.deductAssistCredit
         0     0% 90.84%      4.15s  1.17%  runtime.gcAssistAlloc
         0     0% 90.84%      3.79s  1.07%  runtime.gcAssistAlloc.func1
         0     0% 90.84%      3.79s  1.07%  runtime.gcAssistAlloc1
     0.08s 0.023% 90.86%      3.75s  1.06%  runtime.gcDrainN
         0     0% 90.86%      3.60s  1.02%  runtime.memclrNoHeapPointersChunked
     0.09s 0.025% 90.88%      3.31s  0.93%  github.com/mandiant/GoReSym/objfile.(*peFile).read_memory
     0.01s 0.0028% 90.89%      3.19s   0.9%  github.com/mandiant/GoReSym/debug/macho.(*Section).Data
     0.12s 0.034% 90.92%      3.01s  0.85%  runtime.bgsweep
         0     0% 90.92%         3s  0.85%  runtime.bgscavenge
     0.03s 0.0085% 90.93%      2.99s  0.84%  runtime.(*scavengerState).run
         0     0% 90.93%      2.96s  0.84%  runtime.(*scavengerState).init.func2
     2.95s  0.83% 91.76%      2.95s  0.83%  rsc.io/binaryregexp.newLazyFlag (inline)
     1.84s  0.52% 92.28%      2.95s  0.83%  runtime.findObject
         0     0% 92.28%      2.94s  0.83%  runtime.(*pageAlloc).scavenge
     0.01s 0.0028% 92.28%      2.93s  0.83%  runtime.(*pageAlloc).scavenge.func1
         0     0% 92.28%      2.92s  0.82%  runtime.(*pageAlloc).scavengeOne
         0     0% 92.28%      2.86s  0.81%  runtime.markroot.func1
     0.01s 0.0028% 92.29%      2.80s  0.79%  runtime.sysUnused
     0.02s 0.0056% 92.29%      2.79s  0.79%  runtime.sysUnusedOS
     2.77s  0.78% 93.07%      2.77s  0.78%  runtime.madvise
     0.04s 0.011% 93.09%      2.75s  0.78%  github.com/mandiant/GoReSym/objfile.(*elfFile).read_memory
     0.18s 0.051% 93.14%      2.33s  0.66%  runtime.suspendG
     0.04s 0.011% 93.15%      2.26s  0.64%  io.ReadAtLeast
         0     0% 93.15%      2.26s  0.64%  io.ReadFull (inline)
         0     0% 93.15%      2.18s  0.62%  io.(*SectionReader).Read
     1.33s  0.38% 93.52%      1.92s  0.54%  runtime.scanblock
     0.24s 0.068% 93.59%      1.88s  0.53%  bytes.Index
     0.01s 0.0028% 93.59%      1.85s  0.52%  runtime.gcMarkDone
         0     0% 93.59%      1.79s  0.51%  runtime.markrootBlock

@stevemk14ebr
Copy link
Collaborator

Leaving this open for future optimization work. The kubectl binary is a great test case, it's massive and thoroughly exercises most code paths when run with -t -p -d flags.

@stevemk14ebr
Copy link
Collaborator

stevemk14ebr commented Aug 16, 2023

Easier way to profile use https://www.speedscope.app/ and generate profiles by just importing _ "net/http/pprof" at the top of a package and run with:

go test -run TestBig -cpuprofile cpu.txt

Runtime is still dominated by scanning for moduledata but also the time it takes to parse types via ParseType_impl. Can probably minimize the moduledata scans a bit with some algorithm adjustments to the recovery. We're probably close to scanning through memory as fast as possible. I was able to see it's using bytealg.Index which is an AVX2 hand rolled assembly routine - we hit the fast path as long as our needle is < 64 bytes which we always will be.

@stevemk14ebr
Copy link
Collaborator

Unreasonable memory consumption leading to OOM in sample https://github.com/EdgeGuardP/EdgeGuard-Stealer/releases/tag/EdgeStealer4.0

@virusdefender
Copy link
Contributor

the regex convert from yara is too slow, maybe we can convert it to rangel patterns

https://www.colm.net/open-source/ragel/

@stevemk14ebr
Copy link
Collaborator

@virusdefender is it specifically the conversion routine that is slow or more generally the memory scanning? Can you show some benchmarks that demonstrate the performance issue perhaps?

@virusdefender
Copy link
Contributor

virusdefender commented Sep 17, 2023

@virusdefender is it specifically the conversion routine that is slow or more generally the memory scanning? Can you show some benchmarks that demonstrate the performance issue perhaps?

using the pprof result in #31 (comment)
image

it shows the regex match is slow

@stevemk14ebr
Copy link
Collaborator

@virusdefender that pprof result is out of date, if you can profile the application on the current master you may find it's faster. If not, please provide some data suggesting where else we may be missing optimizations :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants