Skip to content

slothy-optimizer/slothy

Repository files navigation

SLOTHY: Assembly optimization via constraint solving

About SLOTHY

SLOTHY - Super (Lazy) Optimization of Tricky Handwritten assemblY - is an assembly-level superoptimizer for:

  1. Instruction scheduling
  2. Register allocation
  3. Software pipelining (= periodic loop interleaving)

SLOTHY is generic in the target architecture and microarchitecture. This repository provides instantiations for:

  • Armv8.1-M+Helium: Cortex-M55, Cortex-M85
  • AArch64: Cortex-A55, and experimentally Cortex-A72, Cortex-X/Neoverse-V, Apple M1 (Firestorm, Icestorm)

SLOTHY is discussed in Fast and Clean: Auditable high-performance assembly via constraint solving.

Goal

SLOTHY enables a development workflow where developers write 'clean' assembly by hand, emphasizing the logic of the computation, while SLOTHY automates microarchitecture-specific micro-optimizations. This accelerates development, keeps manually written code artifacts maintainable, and allows to split efforts for formal verification into the separate verification of the clean code and the micro-optimizations.

How it works

SLOTHY is essentially a constraint solver frontend: It converts the input source into a data flow graph and builds a constraint model capturing valid instruction schedulings, register renamings, and periodic loop interleavings. The model is passed to an external constraint solver and, upon success, a satisfying assignment converted back into the final code. Currently, SLOTHY uses Google OR-Tools as its constraint solver backend.

Performance

As a rough rule of thumb, SLOTHY typically optimizes workloads of <50 instructions in seconds to minutes, workloads up to 150 instructions in minutes to hours, while for larger kernels some heuristics are necessary.

Applications

SLOTHY has been used to provide the fastest known implementations of various cryptographic and DSP primitives: For example, the SLOTHY paper discusses the NTTs underlying ML-KEM and ML-DSA for Cortex-{A55, A72, M55, M85}, the FFT for Cortex-{M55,M85}, and the X25519 scalar multiplication for Cortex-A55. You find the clean and optimized source code for those examples in paper/.

Getting started

Have a look at the SLOTHY tutorial for a hands-on and example-based introduction to SLOTHY.

Real world uses

Installation

Requirements

SLOTHY has been successfully used on

  • Ubuntu-21.10 and up (64-bit),
  • macOS Monterey 12.6 and up.

SLOTHY requires Python >= 3.10. See requirements.txt for package requirements, and install via pip install -r requirements.txt.

Note: requirements.txt pins versions for reproducibility. If you already have newer versions of some dependencies installed and don't want them downgraded, consider using a virtual environment:

python3 -m venv venv
./venv/bin/python3 -m pip install -r requirements.txt

Then, enter the virtual environment via source venv/bin/activate prior to running SLOTHY.

Docker

A dockerfile for an Ubuntu-22.04 based Docker image with all dependencies of SLOTHY and the PQMX+PQAX test environments setup can be found in paper/artifact/slothy.dockerfile. See paper/artifact/README.md for instructions.

Quick check

To check that your setup is complete, try the following from the base directory:

% python3 example.py --examples aarch64_simple0_a55

You should see something like the following:

* Example: aarch64_simple0_a55...
INFO:aarch64_simple0_a55:Instructions in body: 20
INFO:aarch64_simple0_a55.slothy:Perform internal binary search for minimal number of stalls...
INFO:aarch64_simple0_a55.slothy:Attempt optimization with max 32 stalls...
INFO:aarch64_simple0_a55.slothy:Objective: minimize number of stalls
INFO:aarch64_simple0_a55.slothy:Invoking external constraint solver (OR-Tools CP-SAT v9.7.2996) ...
INFO:aarch64_simple0_a55.slothy:[0.0721s]: Found 1 solutions so far... objective 19.0, bound 8.0 (minimize number of stalls)
INFO:aarch64_simple0_a55.slothy:[0.0765s]: Found 2 solutions so far... objective 18.0, bound 12.0 (minimize number of stalls)
INFO:aarch64_simple0_a55.slothy:OPTIMAL, wall time: 0.155224 s
INFO:aarch64_simple0_a55.slothy:Booleans in result: 509
INFO:aarch64_simple0_a55.slothy.selfcheck:OK!
INFO:aarch64_simple0_a55.slothy:Minimum number of stalls: 18

Examples

The SLOTHY Tutorial and the examples directory contain numerous exemplary assembly snippets. To try them, use python3 example.py --examples={YOUR_EXAMPLE}. See python3 example.py --help for the list of all available examples.

The use of SLOTHY from the command line is illustrated in scripts/ supporting the real-world optimizations for the NTT, FFT and X25519 discussed in Fast and Clean: Auditable high-performance assembly via constraint solving.