[Feature] replace zzBuffer[x] with a zzChar(x) [sf#18] #153

lsf37 · 2015-02-15T11:06:08Z

Reported by briansmith on 2006-10-29 06:30 UTC
My suggestion is simple:

Everywhere that the code generate currently generates zzBuffer[xxxx], it should instead generate
zzCharAt(xxxx).
The standard skeletons define zzCharAt to be:

private char zzCharAt(int index) {
  return zzBuffer[index];
}

The benefit is that, once this is done, then user-defined skeletons can totally replace zzBuffer with any data structure they choose. In particular, they can replace it with a CharSequence.

The JetBrains team has already created such a modified version of JFlex which they recommend people to use to implement on-the-fly lexical analyzis for syntax highlighting and other in-editor uses. Please see the "Implementing a Lexer" section of http://www.jetbrains.com/idea/documentation/idea\_5.0.html
for information.

My suggested variation allows for the same functionality as theirs, while remaining compatible with older JDK versions (without CharSequence). I believe, but haven't verified, that modern JVm's should have no problems inlining zzCharAt() to result in minimal performance impact. At least in my application, performance wasn't impacted.

I will attach a patch in diff -u format.

The text was updated successfully, but these errors were encountered:

lsf37 · 2015-02-15T11:15:19Z

Commented by briansmith on 2006-10-29 06:30 UTC
Patch in diff -u format

jvolkman · 2015-10-22T19:06:33Z

It looks like the original suggestion has become more difficult now that Emitter is calling Character methods that assume a char array (i.e., Character.offsetByCodePoints, Character.codePointAt).

An alternative approach is to modify JFlex's Emitter to require a CharSequence, and update the default skeletons to wrap an incoming char array in a lightweight CharSequence implementation backed by the array. A quick benchmark shows minimal difference in performance between array[n] and sequence.charAt(n).

import java.util.Random;

public class Test {
  static Random random = new Random();
  private static char[] createCharacters() {
    char[] characters = new char[100000];
    for (int i = 0; i < characters.length; i++) {
      characters[i] = (char)('a' + (i % 26));
    }
    return characters;
  }
  private static CharSequence createCharSequence(final char[] array) {
    return createCharSequence(array, 0, array.length);
  }
  private static CharSequence createCharSequence(final char[] array, final int min, final int max) {
    return new CharSequence() {
      @Override
      public int length() {
        return max - min;
      }
      @Override
      public char charAt(int index) {
        return array[min + index];
      }
      @Override
      public CharSequence subSequence(int start, int end) {
        return createCharSequence(array, min + start, min + end);
      }
    };
  }
  private static char testCharArray(char[] array) {
    char c = '\0';
    for (int i = 0; i < 10000000; i++) {
      c = array[random.nextInt(array.length)];
    }
    return c;
  }
  private static char testCharSequence(CharSequence sequence) {
    char c = '\0';
    for (int i = 0; i < 10000000; i++) {
      c = sequence.charAt(random.nextInt(sequence.length()));
    }
    return c;
  }
  private static void runTest(char[] array) {
    CharSequence sequence = createCharSequence(array);
    long start = System.currentTimeMillis();
    testCharArray(array);
    System.out.printf("ARRAY took %d ms\n", System.currentTimeMillis() - start);
    start = System.currentTimeMillis();
    testCharSequence(sequence);
    System.out.printf("SEQUENCE took %d ms\n", System.currentTimeMillis() - start);
  }
  public static void main(String[] args) {
    char[] array = createCharacters();
    for (int i = 0; i < 100; i++) {
      runTest(array);
    }
  }
}

Output snippet:

ARRAY took 98 ms
SEQUENCE took 101 ms
ARRAY took 98 ms
SEQUENCE took 100 ms
ARRAY took 99 ms
SEQUENCE took 101 ms
ARRAY took 100 ms
SEQUENCE took 100 ms
ARRAY took 99 ms
SEQUENCE took 100 ms

lsf37 · 2015-10-26T22:53:37Z

The numbers are encouraging. I think we still need to run it in the JFlex context (i.e. a full scanner loop) to make sure, because it might influence cache-locality etc, which can have surprising effects, but it looks much better than I feared it might.

VladRassokhin · 2016-02-23T07:58:53Z

It seems @jvolkman benchmark spends almost all time in Random#nextInt(int).
I've recreated similar benchmark using JMH, but with pre-computed random indices array and intermediate results are:

Benchmark                       (size)  Mode  Cnt        Score       Error  Units
MyBenchmark.testCharArray     10000000  avgt   50  3999587.111 ± 43656.101  ns/op
MyBenchmark.testCharSequence  10000000  avgt   50  4591648.902 ± 66154.991  ns/op

In ms: 4.09 vs 4.59. Will try to investigate more.
Maybe this numbers are ok (don't think so) and JFlex have performance issues in other places (as any other software) ;)

regisd · 2020-04-01T21:47:30Z

I think we should prioritize this, because most of the lexers I see are "generated by JFlex 1.7.0 tweaked for IntelliJ platform" because they are IntellJ plugins.

lsf37 · 2020-04-02T00:39:21Z

Happy to put this at the top of the list, although I don't think makes sense as a standard setting. Going from 4.0 to 4.6 is significant slowdown for one specific application.

We could provide it as option, though, for when flexibility matters more than performance (like here).

I do not think that IntelliJ lexers are anywhere close to the majority of applications, they are just prominent, which is fine. Very happy to support them in any case, it shouldn't be necessary to tweak the generator to use the lexers.

VladRassokhin · 2020-04-13T09:21:11Z

I'd suggest not to rely on my benchmark because since then everything changed a lot, new JDKs released, they may behave differently and moreover it tests only simple completely out-of-the-real-world use case. Exactlyfor one specific application as you said.
My point was to show that another person's benchmark was completely wrong.

lsf37 · 2020-04-13T11:55:59Z

Good point, we should definitely add this to the benchmark suite and see what it does in context. If it's provided as an option, I'm less worried about the performance impact -- developers can then make the trade-off themselves. It's just if it's an interface change for everyone that we should be more careful.

lsf37 changed the title ~~replace zzBuffer[x] with a zzChar(x)~~ [Feature] replace zzBuffer[x] with a zzChar(x) [sf#18] Feb 15, 2015

lsf37 added the enhancement Feature requests label Feb 15, 2015

jvolkman mentioned this issue Feb 11, 2016

JFlex-1.4.x is ancient JetBrains/Grammar-Kit#89

Closed

regisd added the performance label Nov 4, 2017

regisd added this to TODO in JFlex core Sep 10, 2018

matklad mentioned this issue May 14, 2019

Add experimental Str API rust-lang/rust-analyzer#1275

Closed

regisd added this to the 1.9.0 milestone Apr 1, 2020

regisd added this to To do in Performance Jan 2, 2021

regisd removed the performance label Jan 2, 2021

lsf37 modified the milestones: 1.9.0, 1.10.0 Dec 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] replace zzBuffer[x] with a zzChar(x) [sf#18] #153

[Feature] replace zzBuffer[x] with a zzChar(x) [sf#18] #153

lsf37 commented Feb 15, 2015 •

edited by regisd

lsf37 commented Feb 15, 2015

jvolkman commented Oct 22, 2015

lsf37 commented Oct 26, 2015

VladRassokhin commented Feb 23, 2016

regisd commented Apr 1, 2020

lsf37 commented Apr 2, 2020

VladRassokhin commented Apr 13, 2020

lsf37 commented Apr 13, 2020

[Feature] replace zzBuffer[x] with a zzChar(x) [sf#18] #153

[Feature] replace zzBuffer[x] with a zzChar(x) [sf#18] #153

Comments

lsf37 commented Feb 15, 2015 • edited by regisd

lsf37 commented Feb 15, 2015

jvolkman commented Oct 22, 2015

lsf37 commented Oct 26, 2015

VladRassokhin commented Feb 23, 2016

regisd commented Apr 1, 2020

lsf37 commented Apr 2, 2020

VladRassokhin commented Apr 13, 2020

lsf37 commented Apr 13, 2020

lsf37 commented Feb 15, 2015 •

edited by regisd