Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference without ONNX / usage of WONNX as backend for LLMs #169

Open
philpax opened this issue May 17, 2023 · 21 comments
Open

Inference without ONNX / usage of WONNX as backend for LLMs #169

philpax opened this issue May 17, 2023 · 21 comments

Comments

@philpax
Copy link
Contributor

philpax commented May 17, 2023

Is your feature request related to a problem? Please describe.
I'm one of the maintainers of the llm project, and we're looking for a robust, cross-platform GPU inferencing solutions for our LLM models. We currently have computation graphs for GGML, but are planning on installing some kind of abstraction for use with other backends.

I'm investigating the use of wonnx as a potential backend, but it is (understandably!) coupled to ONNX. I was wondering if it would be possible to specify a computation graph directly for compilation/inference without going through ONNX.

Describe the solution you'd like
A builder API for computation graphs, or something similar, so that a wonnx::Session could be created without the use of ONNX.

Describe alternatives you've considered
I've considered constructing a wonnx::onnx::ModelProto at runtime, but the ONNX format contains a lot of things we don't need or don't have.

It's designed for self-contained models; however, we are loading weights from arbitrary locations and supplying our own computation graph, making it difficult for us to synthesize a complete ONNX model.

Additional context
There's no particular hurry on this. We'd love to have GPU inference as soon as possible - especially truly cross-platform, non-CUDA (!) inference - but I assume this would be a large body of work.

I'm also not sure what operations would need to be implemented for our use case, but we would file PRs as required to implement any missing operations.

@pixelspark
Copy link
Collaborator

Hi @philpax, thanks for bringing this up here. I have given the idea of running LLM’s through wonnx some thought over the past few days and I think it would actually be a great addition (as you say it would provide cross-platform GPU-inference even for non-NVIDIA hardware).

Adding a builder API would be a good first step. Instead of constructing an ONNX model, it could construct a WONNX IR graph directly (the IR currently is based on nodes that are enums containing mostly ONNX structs, so behind the scenes we would still partially build an ONNX graph, but with a much simpler interface. Eventually we can replace the ONNX structs with our own containing just the bits we need/support).

Ideally LLM makes similar calls to wonnx as it does currently to ggml.

In short I think we need to implement the following:

  1. Builder API to construct WONNX IR directly without (visible) reliance on ONNX.

  2. A way to load tensors (initializers in ONNX parlance) from GGML format. This should be easy (at least to implement a slow version) and can be optimized later.

  3. The various ops used in LLM, probably as ‘custom’ non-ONNX standardized operators (i.e. an operator named “ggml.Rope”).

  4. How to handle caches between inference runs.

  5. Quantization support (not sure if WGSL and ONNX supports int4 or we need to get creative). For the MVP we could restrict ourselves to fp8 models of course.

I will not be able to put in significant amounts of work into this over the next weeks but would be highly interested in working on this later and together. Let me know what you think!

@philpax
Copy link
Contributor Author

philpax commented May 17, 2023

That sounds fantastic! Glad to see you're as interested as I am 🙂

A way to load tensors (initializers in ONNX parlance) from GGML format. This should be easy (at least to implement a slow version) and can be optimized later.

Yep, that's reasonable. I'd imagine this would look something like giving wonnx a buffer to the raw data and having it upload it to the GPU.

How to handle caches between inference runs.

We split up our model and inference, so that a Model can remain resident in memory and inference can be done against that Model using an InferenceSession. A similar change might be worthwhile?

Quantization support (not sure if WGSL and ONNX supports int4 or we need to get creative). For the MVP we could restrict ourselves to fp8 models of course.

Yes, this is a little complicated as GGML defines its own quantization formats. You can see what llama.cpp's CUDA code for unpacking/operations looks like here:

https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu

I will not be able to put in significant amounts of work into this over the next weeks but would be highly interested in working on this later and together. Let me know what you think!

This all sounds great to me. Happy to work with you on it - just let me know what you need!

@pixelspark
Copy link
Collaborator

That sounds fantastic! Glad to see you're as interested as I am 🙂

A way to load tensors (initializers in ONNX parlance) from GGML format. This should be easy (at least to implement a slow version) and can be optimized later.

Yep, that's reasonable. I'd imagine this would look something like giving wonnx a buffer to the raw data and having it upload it to the GPU.

How to handle caches between inference runs.

We split up our model and inference, so that a Model can remain resident in memory and inference can be done against that Model using an InferenceSession. A similar change might be worthwhile?

Wonnx does something similar but cannot (yet) share a model and its constant tensors between sessions. It is a good idea to make this split (maybe not for an MVP but still).

Quantization support (not sure if WGSL and ONNX supports int4 or we need to get creative). For the MVP we could restrict ourselves to fp8 models of course.

Yes, this is a little complicated as GGML defines its own quantization formats. You can see what llama.cpp's CUDA code for unpacking/operations looks like here:

https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu

As long as we have a way to quantize/dequantize in WGSL it can be made to work I guess!

I will not be able to put in significant amounts of work into this over the next weeks but would be highly interested in working on this later and together. Let me know what you think!

This all sounds great to me. Happy to work with you on it - just let me know what you need!

Let’s settle on a target/reference model to work with - that allows to compare CPU output with our output. What do you think would be a good reference? (a specific LLaMA-like, 3B, fp8?)

Also it would be helpful if you could investigate the ops we would need and whether there are equivalents in WONNX/ONNX already. If not, we ideally have a reference implementation somewhere else (e.g. in ggml but we could also have a look at MLC-LLM’s WebLLM, there should be some WGSL there?)

@pixelspark pixelspark changed the title Inference without ONNX Inference without ONNX / usage of WONNX as backend for LLMs May 17, 2023
@philpax
Copy link
Contributor Author

philpax commented May 19, 2023

Sorry about the delay in getting back to you!

Wonnx does something similar but cannot (yet) share a model and its constant tensors between sessions. It is a good idea to make this split (maybe not for an MVP but still).

Yeah, I noticed that. Nice to have, but not a showstopper.

As long as we have a way to quantize/dequantize in WGSL it can be made to work I guess!

It should all be possible, but I'm not sure what the best way to handle the changing GGML quantization formats is. Does it make sense to have support for the formats directly in wonnx?

Let’s settle on a target/reference model to work with - that allows to compare CPU output with our output. What do you think would be a good reference? (a specific LLaMA-like, 3B, fp8?)

Agreed - there are lots of LLaMA models out there, but it's best to go for something unburdened. I'd suggest something like the RedPajama models, which are based on the GPT-NeoX architecture, have 3B variants, and can easily be quantized to whatever format: https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-3B-v1

There's already readily-available GGML support, so I'm not too worried about it from our end. Ideally, we can compare outputs by sampling the most probable token each time, but I suspect the differences between GPU and CPU computation will lead to inconsistent results anyway. Perhaps we can measure perplexity?

Also it would be helpful if you could investigate the ops we would need and whether there are equivalents in WONNX/ONNX already.

The operations used by our existing suite of models are the following, with the ones used by GPT-NeoX bolded:

  • add: Adds two tensors together.
  • alibi: Attention with LInear BIases
  • cont: Unsure, need to figure this out
  • cpy: Copies the contents of one tensor to another.
  • diag_mask_inf: Creates a tensor where the elements above the diagonal are -inf.
  • gelu: Gaussian Error Linear Units
  • get_rows: Copies the rows from one tensor to another. Not sure how this is different to cpy.
  • mul: Element-wise multiplication of tensors.
  • mul_mat: Matrix multiplication.
  • norm: Normalise each row of the tensor.
  • repeat: Not sure.
  • rms_norm: RMS norm each row of the tensor.
  • rope: ROtary Positional Encoding
  • scale: Scales a tensor by a scalar.
  • silu: SiLU activation function
  • soft_max: Softmax function

Unfortunately, some of these are quite unclear as GGML's documentation on the actual operations is sparse and the implementations are quite dense. I'll have to further investigate.

From these, I can say that add, mul, mul_mat, softmax are listed in the README (and supported). I'm not sure about the others, but some of them should be trivial to compose from existing operations.

@pixelspark
Copy link
Collaborator

pixelspark commented May 21, 2023

I took a quick look at the ops in bold and I think most will be rather easy to implement. Some ops may not even be needed:

  • All cont appears to do is make a contiguous copy of some other tensor, which typically is a ‘view’ of some tensor. The existing Gather op may be able to provide similar functionality, and if not this is very easy to implement).
  • cpy is likely not necessary as in wonnx, intermediate buffers are immutable and only reused between nodes if their lifetimes allow for it. The only reason to use copy is between iterations (copy output to some input).
  • repeat repeats one tensor as often as necessary to fill a certain shape. Trivial to inplement as new op.
  • scale and gelu look like simple element-wise mappings (just need to write the WGSL for the specific mappings)
  • permute just seems to shuffle dimension metadata. Optimized away in wonnx IR.

I will make a first attempt at the builder API later today (jet lag permitting).

@FL33TW00D
Copy link

@philpax @pixelspark An example of a quantized GEMM in WGSL:
https://github.com/FL33TW00D/wgpu-mm/blob/feature/kernel-series/shaders/gemv/qgemv_1.wgsl

WGSL features some handy packing/unpacking functions to make quantisation easier, however these don't extend to INT4, but it is quite trivial:

fn unpackInt4x8(value: i32, absmax: f32) -> array<vec4<f32>, 2> {
    let x = f32((value << 28) >> 28) / 7.0 * absmax;
    let y = f32((value << 24) >> 28) / 7.0 * absmax;
    let z = f32((value << 20) >> 28) / 7.0 * absmax;
    let w = f32((value << 16) >> 28) / 7.0 * absmax;
    let a = f32((value << 12) >> 28) / 7.0 * absmax;
    let b = f32((value << 8) >> 28) / 7.0 * absmax;
    let c = f32((value << 4) >> 28) / 7.0 * absmax;
    let d = f32((value >> 28)) / 7.0 * absmax;
    return array<vec4<f32>, 2>(vec4<f32>(x, y, z, w), vec4<f32>(a, b, c, d));
}

@pixelspark
Copy link
Collaborator

Agreed - there are lots of LLaMA models out there, but it's best to go for something unburdened. I'd suggest something like the RedPajama models, which are based on the GPT-NeoX architecture, have 3B variants, and can easily be quantized to whatever format: https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-3B-v1

@philpax I am trying to get RedPajama up and running with llm just to familiarize myself with it a bit, but can't get it to work, possibly due to the recent changes in ggml quantization formats. The link above contains a Torch model (which I am sure can be converted to GGML but haven't managed to do). Other repositories have GGML versions but with different file versions, it appears.

Could you perhaps point me to a specific .bin file that should work with latest llm?

@LLukas22
Copy link

LLukas22 commented May 24, 2023

@pixelspark Here are some converted repajama models which should work with the latest main branch. (I havent created the readme yet).

I can also recommend, MPT based models which are also openly licensed. (Instructions can be found in the HF repository).

@pixelspark
Copy link
Collaborator

@pixelspark Here are some converted repajama models which should work with the latest main branch. (I havent created the readme yet).

That link shows a 404 for me?

@LLukas22
Copy link

Sorry, the Repository was still private. But i would still recommend to use MPT as some GptNeoX based models (Including Redpajama) have problems with added BOS tokens. (see rustformers/llm#270)

@pixelspark
Copy link
Collaborator

Thanks @LLukas22, will try this later. If it works I can start investigating the different ops it uses (as listed by @philpax) and check if we can implement those.

@philpax
Copy link
Contributor Author

philpax commented May 25, 2023

Apologies for the confusion there, it's been a bit hectic. We now target GGJT v3/QNT2 exclusively, as of five minutes ago 😅

Yes, RedPajama models are sensitive to BOS, but that shouldn't impact your experimentation too much. I'd suggest sticking with GPT-NeoX as it's a relatively well-established architecture with several models being built on top of it (Pythia, StableLM, RedPajama, etc).

I also realized that my list excludes a few operations that I thought were no-ops in GGML, but I've since realized still create new tensors (with the reshaping happening at the point of tensor creation, and not at the point of graph computation). These operations are:

  • permute: Creates a view tensor with permuted dimensions, such that the dimensions are swapped around in accordance with the axes
  • reshape_2d: Creates a view tensor with dimensions Y*Z from a tensor W*X, where W*X == Y*Z
  • reshape_3d: Same as 2D, but for three dimensions
  • transpose: Creates a view tensor for a 2D tensor in which the two dimensions are swapped
  • view_1d: Creates a 1D view tensor of length ne0 from offset bytes into the tensor
  • view_2d: Creates a 2D view tensor of size ne0 * ne1 from offset bytes into the tensor, with a stride of nb1 bytes
  • view_3d: Creates a 3D view tensor of size ne0 * ne1 * ne2 from offset bytes into the tensor, with a stride of nb1 and nb2 bytes

Yes, GGML can be annoyingly low-level at times :/

I'd maybe suggest skipping the GGML implementation for now and going straight for reimplementing the original Python implementations. They're less likely to encode details like that.

@pixelspark
Copy link
Collaborator

Apologies for the confusion there, it's been a bit hectic. We now target GGJT v3/QNT2 exclusively, as of five minutes ago 😅

So I finally got llm working with mpt and RedPajama yesterday, now it doesn't anymore... very confusing indeed!

@LLukas22 could you point me at the model files I should use now? Below are my results with the RedPajama files currently in HF:

  • RedPajama-INCITE-Chat-3B-v1-f16.bin: loads, but produces gibberish:

image

  • RedPajama-INCITE-Base-3B-v1-q4_0-ggjt: does not load:

image

  • RedPajama-INCITE-Chat-3B-v1-q4_0.bin: does not load:

image

  • RedPajama-INCITE-Chat-3B-v1-q4_0-ggjt.bin: same

Yes, RedPajama models are sensitive to BOS, but that shouldn't impact your experimentation too much. I'd suggest sticking with GPT-NeoX as it's a relatively well-established architecture with several models being built on top of it (Pythia, StableLM, RedPajama, etc).

OK, seems like a good idea

I also realized that my list excludes a few operations that I thought were no-ops in GGML, but I've since realized still create new tensors (with the reshaping happening at the point of tensor creation, and not at the point of graph computation). These operations are:

  • permute: Creates a view tensor with permuted dimensions, such that the dimensions are swapped around in accordance with the axes
  • reshape_2d: Creates a view tensor with dimensions Y*Z from a tensor W*X, where W*X == Y*Z
  • reshape_3d: Same as 2D, but for three dimensions
  • transpose: Creates a view tensor for a 2D tensor in which the two dimensions are swapped
  • view_1d: Creates a 1D view tensor of length ne0 from offset bytes into the tensor
  • view_2d: Creates a 2D view tensor of size ne0 * ne1 from offset bytes into the tensor, with a stride of nb1 bytes
  • view_3d: Creates a 3D view tensor of size ne0 * ne1 * ne2 from offset bytes into the tensor, with a stride of nb1 and nb2 bytes

Yes, I noticed these when browsing the llm code that builds the graph. Thanks for digging into these and the above descriptions, very helpful! These ops are probably all trivial to implement (and/or not necessary).

Yes, GGML can be annoyingly low-level at times :/

I'd maybe suggest skipping the GGML implementation for now and going straight for reimplementing the original Python implementations. They're less likely to encode details like that.

Yes, but ideally we do load the model weights from the GGML quantized model formats (as that is what llm reads), right?

So let's do the exercise one more time then: what is a good reference Python implementation for GPT-NEOX, what ops does it use, and how do they map to the currently supported ops in wonnx? (In researching this it could be interesting to attempt conversion from Torch to ONNX, not because we want to use the ONNX model, but to see how it maps to ONNX ops, as that is wat we currently follow for wonnx IR).

Ideally we would have a picture like this one (linked from here) for GPT-NEOX and with the ops we are going to use/need (preferably existing ONNX implemented ops but I'm open to adding new custom ops to wonnx as well, even if just to speed up inference for LLM, to support quantization, etc.).

@LLukas22
Copy link

@pixelspark GGML recently updated their quantization format (see ggerganov/llama.cpp#1508). Yesterday these changes were merged into llm. This means all quantized models (marked with qX_Y) need to be reconverted. Currently im setting up Github actions to automate these conversions. I will reupload the models to the Rustfomers Organization and add instructions on how to use them. This will take some time as i have to upload around ~200GB of model files. Will let you know when the redpajama models are ready 👍

@LLukas22
Copy link

Alright the models are converted and uploaded. I also added Pythia models, which are smaller GptNeoX models we could use for development.
RedPajama: https://huggingface.co/Rustformers/redpajama-ggml
Pythia: https://huggingface.co/Rustformers/pythia-ggml

@pixelspark
Copy link
Collaborator

pixelspark commented May 25, 2023

@LLukas22 still seeing the following:

image

SHA-256 Hash of the file as I see it (PowerShell Get-FileHash):

D0E1BC0DEA48252CCE95552DBCA4E74DE9D49024C4583DEDD497359A89B2F9A2

As for MPT:

image

Do I need to use other files?

@LLukas22
Copy link

LLukas22 commented May 25, 2023

Hm strange, i'm using the exact same model and same git revision and its working as expected (First few tokens are garbled for redpajama, because of the BOS issue). Maybe you have to run a cargo update in your clone? @philpax Do you have any idea why this could happen?
grafik

MPT results:
grafik

@pixelspark
Copy link
Collaborator

pixelspark commented May 25, 2023

@LLukas22 sorry I am stupid - forgot to do a git submodule update. Now it all seems to work!

Still getting some weird results though:

image

MPT is fine apparently although... very elaborate (?):

image

@LLukas22
Copy link

As previously mentioned the strange results from Redpajama are expected as the CLI uses the wrong BOS token atm.
Tbh i dont exactly know what the chat feature in the CLI does, but i'm guessing it was build for LLama based models. MPT-Chat probably expects a different prompt format, which the current chat implementation does not provide. I would recommend sticking to the infer command if you want to play around with the models.

@antimora
Copy link

Just wanted to add that the Burn team love to use parts of WONNX as a backend for Burn (without ONNX). We are tracking this work here (tracel-ai/burn#243). CCing @nathanielsimard since he has started doing research in this area.

@pixelspark
Copy link
Collaborator

@antimora great to hear! If you haven't seen it yet, you might want to have a look at #170. It is work in progress to offer a non-ONNX builder API from WONNX (it will primarily offer access to the implemented ONNX ops but possibly others in the future). Contributions are welcomed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants