Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usar PGO e G1GC na imagem nativa #1

Closed
lobaorn opened this issue Feb 12, 2024 · 25 comments
Closed

Usar PGO e G1GC na imagem nativa #1

lobaorn opened this issue Feb 12, 2024 · 25 comments

Comments

@lobaorn
Copy link

lobaorn commented Feb 12, 2024

A ideia é de fazer um build com instrumentação, rodar o Gatling na aplicação instrumentada, e depois fazer o build novamente usando a informação da instrumentação para o a versao otimizada, e aí deixa a versao otimizada no seu dockerhub para a competição. Basicamente isso.

Links:

@lobaorn
Copy link
Author

lobaorn commented Feb 12, 2024

Here is the specific time in @alina-yur talk about the necessary build-args: https://youtu.be/8umoZWj6UcU?si=XJBhf9n8N9HN2Drr&t=2130

Which are:

--enable-monitoring=all --pgo=default.iprof --gc=G1

But in this case you would need first to build with a --pgo-instrument, run the Gatling againts the native-image, get the default.iprof and then build it wihout the --pgo-instrument.

I guess that is it.

@lobaorn
Copy link
Author

lobaorn commented Feb 12, 2024

Ops, aqui está um material dedicado a isso:

https://github.com/alina-yur/native-spring-boot

@lobaorn
Copy link
Author

lobaorn commented Feb 12, 2024

Inclusive, sem virtual threads enabled? Achei curioso.

spring.threads.virtual.enabled=true

Suponho que traria benefícios também, independente do PGO e etc.

@rodrigorodrigues
Copy link
Owner

Fala @lobaorn obrigado pela dica, apliquei aqui e gerou um arquivo gigante ai depois compilei novamente com a versão otimizada mas nao vi muita diferença nos resultados, vc quer tentar fazer um teste dai?

before
before.tar.gz

image

after
after.tar.gz

image

@lobaorn
Copy link
Author

lobaorn commented Feb 12, 2024

Cara, vou olhar quando der mas pelo que aparenta na verdade o resultado piorou muito, não? Diria que em tese está rodando só a versão instrumentada e não otimizada. Talvez tenha sido só a questão do diretório/caminho do arquivo .iprof para a compilação otimizada, pq o resultado realmente fica um contrasenso.

@rodrigorodrigues
Copy link
Owner

Gerei uma versão nova com um Dockerfile acho que to fazendo algo perguntei pra @alina pra ver se ela pode ajudar.

new.tar.gz

image

@lobaorn
Copy link
Author

lobaorn commented Feb 13, 2024

Lembrando que é a @alina-yur e não @alina. (Tô no Cel hoje por isso só devo ver amanhã).

@alina-yur
Copy link

alina-yur commented Feb 13, 2024

hi folks @lobaorn @rodrigorodrigues, glad it helps!

So the steps to build an optimized executable are the following:

  1. Build an instrumented app:
    native-image --pgo-instrument MyApp
  2. Run the instrumented executable and apply workloads to it to generate a profile file:
    ./myapp
  3. Build an optimized executable:
    native-image --pgo=default.iprof MyApp

In step 3 you can either directly specify the profiles file with --pgo=default.iprof (--pgo=${project.basedir}/default.iprof), or by just passing the --pgo flag which will result in pick up the profiles file from a default location. But make sure you are passing one of those forms of the PGO flag so that Native Image is instructed to build an optimized image. You can verify that it happens by seeing in the beginning of Native Image build output "PGO: user-provided".

@rodrigorodrigues
Copy link
Owner

rodrigorodrigues commented Feb 13, 2024

Hi @alina-yur thanks for your message, I was able to generate an optimized executable but I was wondering how can I containerised it using mvn -P native spring-boot:build-image like this default.iprof should be inside the container? I couldn't see much difference after ran Gatling performance with optimized version, I guess I did something wrong.

Thanks,
Kind Regards,
Rodrigo

@alina-yur
Copy link

Replied in alina-yur/native-spring-boot#1 (comment):)

The thing with PGO is to take sure you apply relevant workloads for sufficient time, emulate expected application behavior closely enough, so that the app can collect correct and full profiles. But in general PGO should always give some more performance even if profiles aren't complete/great. Make sure also as I mentioned above that the profiles are being picked up:

Screenshot 2024-02-13 at 10 53 54

@rodrigorodrigues
Copy link
Owner

Hi @alina-yur ,

I couldn't find this extra PGO: user-provided in the logs usingmvn clean package -Pnative spring-boot:build-image maybe I'm still doing something wrong.

image

pom.xml

<build>
        <plugins>
            <plugin>
                <groupId>org.graalvm.buildtools</groupId>
                <artifactId>native-maven-plugin</artifactId>
                <configuration>
                    <buildArgs>
                        <buildArg>--pgo=default.iprof</buildArg>
                        <buildArg>--gc=G1</buildArg>
                    </buildArgs>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
        </plugins>
    </build>

Using the other command mvn -Pnative native:compile you suggested in your repo can see in logs but then no docker image is generated just the native executable in the target folder.

image

@lobaorn
Copy link
Author

lobaorn commented Feb 13, 2024

Hey there, I am still away from my computer, but to try and help:

The native: command invokes the native plugin, while the spring-boot: invokes the boot plugin, with native profile. But either it does not work with pgo, or it also needs the same buildArgs passed into its configuration part.

Anyway, did the pgo version have better results? And maybe multiple runs of the test could improve a little further with the profiling.

@alina-yur
Copy link

@rodrigorodrigues do you know if the profile file gets generated?

@rodrigorodrigues
Copy link
Owner

rodrigorodrigues commented Feb 13, 2024

Hi @alina-yur, @lobaorn have some good progress, the file is generated after shutdown the app and it's huge.

rodrigo@rodrigo-Inspiron-5567:~/Downloads/workspace/rinha-de-backend-2024-q1-javaslow-spring$ kill -15 807936
-bash: kill: (807936) - No such process
[1]+  Exit 143                SERVER_PORT=9999 SPRING_CASSANDRA_LOCAL_DATACENTER=datacenter1 SPRING_CASSANDRA_KEYSPACE_NAME=rinha SPRING_THREADS_VIRTUAL_ENABLED=true ./target/rinha-backend-2024q1-javaslow-spring
rodrigo@rodrigo-Inspiron-5567:~/Downloads/workspace/rinha-de-backend-2024-q1-javaslow-spring$ ll default.iprof
-rw------- 1 rodrigo rodrigo 27797707 Feb 13 17:42 default.iprof

I couldn't see a way to make that works using spring-boot plugin so created a simple Dockerfile to copy the native app.

FROM container-registry.oracle.com/os/oraclelinux:8-slim
COPY target/rinha-backend-2024q1-javaslow-spring javaslow-spring
ENTRYPOINT ["/javaslow-spring"]

I've changed the Gatling script to run more data and now can see better results.

instrumented version -
instrumented.tar.gz

========================================================================================================================
GraalVM Native Image: Generating 'rinha-backend-2024q1-javaslow-spring' (executable)...
========================================================================================================================
[1/8] Initializing...                                                                                    (9.9s @ 0.18GB)
 Java version: 21.0.2+13-LTS, vendor version: Oracle GraalVM 21.0.2+13.1
 Graal compiler: optimization level: 2, target machine: x86-64-v3, PGO: instrument
 C compiler: gcc (linux, x86_64, 9.4.0)

...

Gatling results.

================================================================================
---- Global Information --------------------------------------------------------
> request count                                     189460 (OK=189460 KO=0     )
> min response time                                      2 (OK=2      KO=-     )
> max response time                                   1399 (OK=1399   KO=-     )
> mean response time                                   107 (OK=107    KO=-     )
> std deviation                                        144 (OK=144    KO=-     )
> response time 50th percentile                         31 (OK=31     KO=-     )
> response time 75th percentile                        175 (OK=175    KO=-     )
> response time 95th percentile                        405 (OK=405    KO=-     )
> response time 99th percentile                        595 (OK=595    KO=-     )
> mean requests/sec                                314.718 (OK=314.718 KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                        189189 (100%)
> 800 ms <= t < 1200 ms                                265 (  0%)
> t >= 1200 ms                                           6 (  0%)
> failed                                                 0 (  0%)
================================================================================

image

optimised version -
optimised.tar.gz

========================================================================================================================
GraalVM Native Image: Generating 'rinha-backend-2024q1-javaslow-spring' (executable)...
========================================================================================================================
[1/8] Initializing...                                                                                   (11.9s @ 0.16GB)
 Java version: 21.0.2+13-LTS, vendor version: Oracle GraalVM 21.0.2+13.1
 Graal compiler: optimization level: 3, target machine: x86-64-v3, PGO: user-provided
 C compiler: gcc (linux, x86_64, 9.4.0)

...

Gatling results.

================================================================================
---- Global Information --------------------------------------------------------
> request count                                     189460 (OK=189460 KO=0     )
> min response time                                      1 (OK=1      KO=-     )
> max response time                                    873 (OK=873    KO=-     )
> mean response time                                    20 (OK=20     KO=-     )
> std deviation                                         40 (OK=40     KO=-     )
> response time 50th percentile                          3 (OK=3      KO=-     )
> response time 75th percentile                         23 (OK=23     KO=-     )
> response time 95th percentile                         86 (OK=86     KO=-     )
> response time 99th percentile                        195 (OK=195    KO=-     )
> mean requests/sec                                314.718 (OK=314.718 KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                        189458 (100%)
> 800 ms <= t < 1200 ms                                  2 (  0%)
> t >= 1200 ms                                           0 (  0%)
> failed                                                 0 (  0%)

image

Btw we're doing this just for fun there's a competion with devs from Brazil to see who creates the fastest API using any language/framework then I decided to pick stack Java 21 GraavlVM, Spring Boot and Cassandra.
More details can be found here and sorry it's written only in Portuguese.

I've pushed the optimised image on docker hub and now gonna create a pull request for them to retest with this new image.

I'd like to tag you both in the project if it's not a problem, thanks for the collaboration folks really appreciate your help. :)

@lobaorn
Copy link
Author

lobaorn commented Feb 13, 2024

Hey @rodrigorodrigues for what I can see in the images, the response time of the optimized version was worse than even the instrumented?

In that case, how was the result for the default native image, or of Oracle GRAAL M Jit?

Not sure if G1 in this context is worse than Serial given the vCPU and memory constraints, but the result seems weird from the images... or the problem is that you were running the instrumented outside docker, and that made it faster, or the base image of the docker is not the same of your host machine, so the optimizations applied from the -O3 of PGO and the arch specifics are not the same of the docker image, so it is hurting the performance instead of helping.

So I would guess some tweaking of the properties would be beneficial, or trying with Oracle GraalVM JIT also.

In that case, from the fastest entries on TechEmpower Benchmarks, are always with -XX:+UseNUMA and -XX:+UseParallelGC so they probably won't hurt in this case for the JIT.

@rodrigorodrigues
Copy link
Owner

Hi @lobaorn you're right sorry my bad I ran the instrumented version direct so gonna run again via docker image and update reports.

@lobaorn
Copy link
Author

lobaorn commented Feb 13, 2024

Also, here we have many demos of GraalVM native image also, for dockerization with spring-boot:

https://github.com/graalvm/graalvm-demos/tree/master/spring-native-image

So perhaps you could reuse the same suggestion for the docker for native image, being:

https://github.com/graalvm/graalvm-demos/blob/master/spring-native-image/Dockerfiles/Dockerfile.native

The image itself might have some tweaks that could be better.

@lobaorn
Copy link
Author

lobaorn commented Feb 13, 2024

Just one adjustment from the suggested Dockerfile, there is already a oraclelinux:9-slim, available also on dockerhub not just oracle container registry:

https://hub.docker.com/_/oraclelinux

https://hub.docker.com/layers/library/oraclelinux/9-slim/images/sha256-10a56fb3e4b65c7016b9b5c039c82d7c96cb64b05f111cd1535cde601594bc31?context=explore

Size-wise the oraclelinux is bigger than ubuntu-jammy (https://hub.docker.com/layers/library/ubuntu/jammy/images/sha256-bcc511d82482900604524a8e8d64bf4c53b2461868dac55f4d04d660e61983cb?context=explore), but I am not sure if it can wield better results.

@rodrigorodrigues
Copy link
Owner

Hi @lobaorn updated my previous comment with correct reports using the docker image version seems better.

@lobaorn
Copy link
Author

lobaorn commented Feb 13, 2024

Yes, now it makes more sense. The question is if the native without optimization is better, or if the JIT is better. But I guess you will check it if makes sense. Also, I am not sure if by default when --pgo is passed, it also passes -march=native. Since I saw you are in a linux-amd64 and the docker image will also be of same, it could benefit from -march=native possibly.

And I am not sure if you enabled virtual threads, since there is no new commit.

In another direction, just to vent ideas, one thing I will try by the end of the month is to use Unix sockets for the load balancing in nginx and to make SB (with Tomcat or other) to listen to it:

@rodrigorodrigues
Copy link
Owner

rodrigorodrigues commented Feb 13, 2024

Hi @lobaorn actually the virtual threads is enabled on docker-compose.yaml, just tried with both cases and yes with virtual threads enabled results are much better.

SPRING_THREADS_VIRTUAL_ENABLED=true

================================================================================
---- Global Information --------------------------------------------------------
> request count                                      61390 (OK=61390  KO=0     )
> min response time                                      1 (OK=1      KO=-     )
> max response time                                    155 (OK=155    KO=-     )
> mean response time                                     6 (OK=6      KO=-     )
> std deviation                                          8 (OK=8      KO=-     )
> response time 50th percentile                          3 (OK=3      KO=-     )
> response time 75th percentile                          7 (OK=7      KO=-     )
> response time 95th percentile                         17 (OK=17     KO=-     )
> response time 99th percentile                         38 (OK=38     KO=-     )
> mean requests/sec                                253.678 (OK=253.678 KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                         61390 (100%)
> 800 ms <= t < 1200 ms                                  0 (  0%)
> t >= 1200 ms                                           0 (  0%)
> failed                                                 0 (  0%)
================================================================================

SPRING_THREADS_VIRTUAL_ENABLED=false

===============================================================================
---- Global Information --------------------------------------------------------
> request count                                      61390 (OK=61389  KO=1     )
> min response time                                      1 (OK=1      KO=8     )
> max response time                                    302 (OK=302    KO=8     )
> mean response time                                     6 (OK=6      KO=8     )
> std deviation                                         12 (OK=12     KO=0     )
> response time 50th percentile                          3 (OK=3      KO=8     )
> response time 75th percentile                          7 (OK=7      KO=8     )
> response time 95th percentile                         19 (OK=19     KO=8     )
> response time 99th percentile                         54 (OK=54     KO=8     )
> mean requests/sec                                 254.73 (OK=254.726 KO=0.004 )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                         61389 (100%)
> 800 ms <= t < 1200 ms                                  0 (  0%)
> t >= 1200 ms                                           0 (  0%)
> failed                                                 1 (  0%)
---- Errors --------------------------------------------------------------------
> jmesPath(saldo.total).find.is(0), but actually found 58793          1 (100.0%)
================================================================================

tbh I'm not familiar with Gatling I'm still confused not sure if with PGO approach is better than the previous one without it, look few different runs.

no pgo = virtual threads enabled
no-pgo.tar.gz

image

with pgo = virtual threads enabled
pgo-with-virtual-threads.tar.gz

image

with pgo = virtual threads disabled
pgo-no-virtual-threads.tar.gz

image

@lobaorn
Copy link
Author

lobaorn commented Feb 13, 2024

The point is that the requests come in the same cadence always, so what it is need to check is the values for each column, which is the response time. The higher the number, the worse it is.

So by all the images, PGO is working and also Virtual Threads is giving benefits.

From the other issue opened, probably would be a good idea to remove the actuator, that maybe is getting much overhead.

You could check your native files (by dump) using GraalVM Dashboard: https://www.graalvm.org/latest/reference-manual/native-image/guides/use-graalvm-dashboard/

https://www.graalvm.org/dashboard/

So you can understand better, by the size of things, how much can be lessened by using more or less dependencies.

And as I said previously, perhaps the G1 GC is not the better option for this context, and perhaps the SerialGC could do better. But than, just trying to really know.

At least now we know that PGO is working as expected @alina-yur .

And also I am not sure if you give an even bigger .iprof file to PGO if it could be better @rodrigorodrigues . Not by making the gatling bigger, actually I would make just as the same that is on the original file, but just run it multiple times.

@lobaorn
Copy link
Author

lobaorn commented Feb 13, 2024

And also there is a Slack channel for native-image discussions, perhaps people there would also be interested about this type of discussions/challenges: https://graalvm.slack.com/archives/CN9KSFB40

@lobaorn
Copy link
Author

lobaorn commented Feb 14, 2024

Hey @rodrigorodrigues , I found a great tutorial on how to make static and mostly static Spring Boot native images as well. It follows the same guidance of the official docs, but with all the necessary steps, even the Dockerfile.

The official docs are:

The tutorial I found is:

With that configuration, and the -march=native and analysis of the dump with the GraalVM Dashboard would probably give the last drop of performance for the application without changes to the code itself.

I guess that if you ran the original Gatling tests multiple times, all the runs after the first will show wrong answers and numbers, but that will not matter, since the important part is to really give the best profile for the running app without modifications. If you could give 3-4x consecutive runs, and build it as a fully static optimized native image, I think it will give the best result possible.

The doubt remains about G1GC vs SerialGC, and which other configurations of heap to apply.

Also, another thing to try is to use the --initialize-at-build-time so that everything would be initialized at build time instead of something on runtime. If it builds and runs without any hiccup, better still.

Another thing that I found on looking for optimizations of SpringBoot with native image, is this wiki: https://github.com/spring-projects/spring-boot/wiki/Spring-Boot-with-GraalVM

Specially this part about Tomcat:

Using tomcat-embed-programmatic
Motivation
tomcat-embed-programmatic is an experimental Tomcat dependency designed to lower the memory footprint. Using it produces smaller native images.

Then there is the necessary POM changes for the proper exclusion and inclusion.

If I find anything else, I will share. I will probably be doing everything I am listing here later this month, my personal machine is unavailable and this one is not proper to development... but the research continues :)

@rodrigorodrigues
Copy link
Owner

Hi @lobaorn, @alina-yur I've finished this task at the end I decided to create a new version without framework to see if can go faster, the results looks good.

image

Source code is here created 2 different profiles same as Alina's demo.

Thanks for all your supported closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants