Skip to content

Commit

Permalink
Ensure the oxide parser has feature parity with the stable RegEx pars…
Browse files Browse the repository at this point in the history
…er (#11389)

* WIP

* use `parse` instead of `defaultExtractor`

* skip `Vue` describe block

* add a few more dedicated arbitrary values/properties tests

* use parallel parsing

* splitup Vue tests

* add some Rust specific tests

* setup parse candidate strings test system

These tests will run against the `Regex` and `Rust` based parsers. We
have groups of classes of various shapes and forms + variants and
rendered in various template situation (plain, html, Vue, ...)

+ enable all skipped tests

* ensure we also validate the classes with variants

The classes with variants are built in the `templateTable` function, so
we get them out again by using the potional arguments of the `test.each`
cb function.

* cleanup test suite

* add "anti-test" tests

To make sure that we are _not_ parsing out certain values given a
certain input.

* Add ParseAction enum

* Restart parsing following an arbitrary parse failure

* Split variants off before validating the uility part

* Collapse candidate from the end when validation fails

* Support `<`, and `>` in variant position

* fix error

* format parser.rs

* Refactor

* Update editorconfig

* wip

* wip

* Refactor

* Refactor

* Simplify

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* run `cargo clippy --fix`

* run `cargo fmt`

* implement `cargo clippy` suggestions

These were not applied using `cargo clippy --fix`

* only allow `.` in the candidate part when surrounded by 0-9

This is only in the candidate part, not the arbitrary part.

* % characters can only appear at the end after digits

* > and < should only be part of variants (start OR end)

It can technically be inside the candidate when we have stacked
variants:
```
dark:<sm:underline
dark:md>:underline
```

* handle parsing utilities within quotes, parans or brackets

* mark `pt-1.5` as an expected value sliced out from `["pt-1.5"]`

* Add cursor abstraction

* wip

* disable the oxideParser if using a custom `prefix` or `separator`

* update tests

* Use cursor abstraction

* Refactor more code toward use of global cursor

* wip

* simplify

* Simplify

* Simplify

* Simplify

* Cleanup

* wip

* Simplify

* wip

* Simplify

* Handle candidates ending with % sign

* Tweak code a bit

* fmt

* Simplify

* Add cursor details to trace

* cargo fmt

* use preferred `zoom-0.5` name instead of `zoom-.5`

* drop over-extracted utilities in oxide parser

The RegEx parser does extract `underline` from

```html
<div class="peer-aria-[labelledby='a_b']:underline"></div>
```
... but that's not needed and is not happening in the oxide parser

This means that we have to make the output check a little bit different
but they are explicit based on the feature flag.

* allow extracting variants+utilities inside `{}` for the oxide parser

* characters in candidates such as `group-${id}` should not be allowed

* do not extract any of the following candidate `w-[foo-bar]w-[bar-baz]`

* ensure we can consume the full candidate and discard it

* Add fast skipping of whitespace

* Use fast skipping whenever possible

* Add fast skipping to benchmark

* Hand-tune to generate more optimized assembly

* Move code around a bit

This makes sure all the fancy SIMD stuff is as early as possible. This results in an extremely minor perf increase.

* Undo tweak

no meaningful perf difference in real world scenarios

* Disable fast skipping for now

It needs to be done in a different spot so it doesn’t affect how things are returned

* Change test names

* Fix normalize config error

* cleanup a bit

* Cleanup

* Extract validation result enum

* Cleanup comments

* Simplify

* Fix formatting

* Run clippy

* wip

* add `md>` under the special characters test set

---------

Co-authored-by: Adam Wathan <4323180+adamwathan@users.noreply.github.com>
Co-authored-by: Jordan Pittman <jordan@cryptica.me>
  • Loading branch information
3 people committed Jun 7, 2023
1 parent e572dc6 commit 55daf8e
Show file tree
Hide file tree
Showing 12 changed files with 1,379 additions and 217 deletions.
8 changes: 8 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,11 @@ end_of_line = lf
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true

[*.rs]
indent_style = space
indent_size = 4
end_of_line = lf
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
12 changes: 12 additions & 0 deletions oxide/crates/core/benches/parse_candidates.rs
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,18 @@ pub fn criterion_benchmark(c: &mut Criterion) {
c.bench_function("parse_candidate_strings (real world)", |b| {
b.iter(|| parse(include_bytes!("./fixtures/template-499.html")))
});

let mut group = c.benchmark_group("sample-size-example");
group.sample_size(10);

group.bench_function("parse_candidate_strings (fast space skipping)", |b| {
let count = 10_000;
let crazy1 = format!("{}underline", " ".repeat(count));
let crazy2 = crazy1.repeat(count);
let crazy3 = crazy2.as_bytes();

b.iter(|| parse(black_box(crazy3)))
});
}

criterion_group!(benches, criterion_benchmark);
Expand Down
159 changes: 159 additions & 0 deletions oxide/crates/core/src/cursor.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
use std::{ascii::escape_default, fmt::Display};

#[derive(Debug, Clone)]
pub struct Cursor<'a> {
// The input we're scanning
pub input: &'a [u8],

// The location of the cursor in the input
pub pos: usize,

/// Is the cursor at the start of the input
pub at_start: bool,

/// Is the cursor at the end of the input
pub at_end: bool,

/// The previously consumed character
/// If `at_start` is true, this will be NUL
pub prev: u8,

/// The current character
pub curr: u8,

/// The upcoming character (if any)
/// If `at_end` is true, this will be NUL
pub next: u8,
}

impl<'a> Cursor<'a> {
pub fn new(input: &'a [u8]) -> Self {
let mut cursor = Self {
input,
pos: 0,
at_start: true,
at_end: false,
prev: 0x00,
curr: 0x00,
next: 0x00,
};
cursor.move_to(0);
cursor
}

pub fn rewind_by(&mut self, amount: usize) {
self.move_to(self.pos.saturating_sub(amount));
}

pub fn advance_by(&mut self, amount: usize) {
self.move_to(self.pos.saturating_add(amount));
}

pub fn move_to(&mut self, pos: usize) {
let len = self.input.len();
let pos = pos.clamp(0, len);

self.pos = pos;
self.at_start = pos == 0;
self.at_end = pos + 1 >= len;

self.prev = if pos > 0 { self.input[pos - 1] } else { 0x00 };
self.curr = if pos < len { self.input[pos] } else { 0x00 };
self.next = if pos + 1 < len {
self.input[pos + 1]
} else {
0x00
};
}
}

impl<'a> Display for Cursor<'a> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
let len = self.input.len().to_string();

let pos = format!("{: >len_count$}", self.pos, len_count = len.len());
write!(f, "{}/{} ", pos, len)?;

if self.at_start {
write!(f, "S ")?;
} else if self.at_end {
write!(f, "E ")?;
} else {
write!(f, "M ")?;
}

fn to_str(c: u8) -> String {
if c == 0x00 {
"NUL".into()
} else {
format!("{:?}", escape_default(c).to_string())
}
}

write!(
f,
"[{} {} {}]",
to_str(self.prev),
to_str(self.curr),
to_str(self.next)
)
}
}

#[cfg(test)]
mod test {
use super::*;

#[test]
fn test_cursor() {
let mut cursor = Cursor::new(b"hello world");
assert_eq!(cursor.pos, 0);
assert!(cursor.at_start);
assert!(!cursor.at_end);
assert_eq!(cursor.prev, 0x00);
assert_eq!(cursor.curr, b'h');
assert_eq!(cursor.next, b'e');

cursor.advance_by(1);
assert_eq!(cursor.pos, 1);
assert!(!cursor.at_start);
assert!(!cursor.at_end);
assert_eq!(cursor.prev, b'h');
assert_eq!(cursor.curr, b'e');
assert_eq!(cursor.next, b'l');

// Advancing too far should stop at the end
cursor.advance_by(10);
assert_eq!(cursor.pos, 11);
assert!(!cursor.at_start);
assert!(cursor.at_end);
assert_eq!(cursor.prev, b'd');
assert_eq!(cursor.curr, 0x00);
assert_eq!(cursor.next, 0x00);

// Can't advance past the end
cursor.advance_by(1);
assert_eq!(cursor.pos, 11);
assert!(!cursor.at_start);
assert!(cursor.at_end);
assert_eq!(cursor.prev, b'd');
assert_eq!(cursor.curr, 0x00);
assert_eq!(cursor.next, 0x00);

cursor.rewind_by(1);
assert_eq!(cursor.pos, 10);
assert!(!cursor.at_start);
assert!(cursor.at_end);
assert_eq!(cursor.prev, b'l');
assert_eq!(cursor.curr, b'd');
assert_eq!(cursor.next, 0x00);

cursor.rewind_by(10);
assert_eq!(cursor.pos, 0);
assert!(cursor.at_start);
assert!(!cursor.at_end);
assert_eq!(cursor.prev, 0x00);
assert_eq!(cursor.curr, b'h');
assert_eq!(cursor.next, b'e');
}
}
89 changes: 89 additions & 0 deletions oxide/crates/core/src/fast_skip.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
use crate::cursor::Cursor;

const STRIDE: usize = 16;
type Mask = [bool; STRIDE];

#[inline(always)]
pub fn fast_skip(cursor: &Cursor) -> Option<usize> {
// If we don't have enough bytes left to check then bail early
if cursor.pos + STRIDE >= cursor.input.len() {
return None;
}

if !cursor.curr.is_ascii_whitespace() {
return None;
}

let mut offset = 1;

// SAFETY: We've already checked (indirectly) that this index is valid
let remaining = unsafe { cursor.input.get_unchecked(cursor.pos..) };

// NOTE: This loop uses primitives designed to be auto-vectorized
// Do not change this loop without benchmarking the results
// And checking the generated assembly using godbolt.org
for (i, chunk) in remaining.chunks_exact(STRIDE).enumerate() {
let value = load(chunk);
let is_whitespace = is_ascii_whitespace(value);
let is_all_whitespace = all_true(is_whitespace);

if is_all_whitespace {
offset = (i + 1) * STRIDE;
} else {
break;
}
}

Some(cursor.pos + offset)
}

#[inline(always)]
fn load(input: &[u8]) -> [u8; STRIDE] {
let mut value = [0u8; STRIDE];
value.copy_from_slice(input);
value
}

#[inline(always)]
fn eq(input: [u8; STRIDE], val: u8) -> Mask {
let mut res = [false; STRIDE];
for n in 0..STRIDE {
res[n] = input[n] == val
}
res
}

#[inline(always)]
fn or(a: [bool; STRIDE], b: [bool; STRIDE]) -> [bool; STRIDE] {
let mut res = [false; STRIDE];
for n in 0..STRIDE {
res[n] = a[n] | b[n];
}
res
}

#[inline(always)]
fn all_true(a: [bool; STRIDE]) -> bool {
let mut res = true;
for item in a.iter().take(STRIDE) {
res &= item;
}
res
}

#[inline(always)]
fn is_ascii_whitespace(value: [u8; STRIDE]) -> [bool; STRIDE] {
let whitespace_1 = eq(value, b'\t');
let whitespace_2 = eq(value, b'\n');
let whitespace_3 = eq(value, b'\x0C');
let whitespace_4 = eq(value, b'\r');
let whitespace_5 = eq(value, b' ');

or(
or(
or(or(whitespace_1, whitespace_2), whitespace_3),
whitespace_4,
),
whitespace_5,
)
}
2 changes: 2 additions & 0 deletions oxide/crates/core/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ use tracing::event;
use walkdir::WalkDir;

pub mod candidate;
pub mod cursor;
pub mod fast_skip;
pub mod glob;
pub mod location;
pub mod modifier;
Expand Down

0 comments on commit 55daf8e

Please sign in to comment.