Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow on Llama_cpp_python #4

Open
dnhkng opened this issue Jan 26, 2024 · 10 comments
Open

Slow on Llama_cpp_python #4

dnhkng opened this issue Jan 26, 2024 · 10 comments

Comments

@dnhkng
Copy link

dnhkng commented Jan 26, 2024

This seems to work nicely for my use case, but using the llama_cpp_python backend with a long prompt and a dozen completions takes about 1 minute. Any tips on improving speed?

@kddubey
Copy link
Owner

kddubey commented Jan 26, 2024

Hi @dnhkng,

Unfortunately cappr.llama_cpp can be slow. This is partly b/c there currently isn't a way to batch over completions like we can in cappr.huggingface. The CAPPr algorithm also doesn't scale well, as discussed here.

2 different routes to improve performance:

  1. Shorten the prompt. Perhaps if you post the prompt and completions we can figure out ways to make it shorter.

  2. Don't use CAPPr. There are other LLM structuring tools which provide a "just pick" functionality. I'm most familiar with the algorithm for guidance.select, and it's quite efficient.

Apologies for the slowness, and thanks for raising the issue

@dnhkng
Copy link
Author

dnhkng commented Jan 26, 2024

I do want to use CAPPr.

I need the probabilities (yes I know that we can't use these as confidence values), but I think I have found something useful.

I need to run a few thousand inference over a long prompt, with a large number of categories to test this though.

Any other tips would be welcome.

@kddubey
Copy link
Owner

kddubey commented Jan 26, 2024

  1. Are you using a GPU? If not (and assuming it's ok to use a cloud GPU for security/compliance), can you point to the GGUF model you're using? Maybe the model and task will fit on a free or cheap GPU

  2. Can you use a smaller version of the model? Also, if there's a AutoGPTQ version of the model, cappr.huggingface may be faster. Lmk which model you're using

  3. Can you post the prompt and completions (or something similar-enough to work with)? In case you're doing in-context learning / few-shot prompting, consider reducing or eliminating the examples / do a zero-shot prompt. Sometimes, the lift from in-context learning is marginal

@kddubey
Copy link
Owner

kddubey commented Jan 26, 2024

Another tip: if all of the prompts start with the same long substring, e.g., system instructions and few-shot examples, use cappr.llama_cpp.cache_model

@dnhkng
Copy link
Author

dnhkng commented Jan 26, 2024

Thanks for the tips!

  1. yes, using a beefy GPU
  2. A big model is required, unfortunately.
  3. the prompt is necessarily long, no way around that.

I'm currently using Outlines for category selection, but I want to try a new approach that requires probabilities. It initially, I planned on doing the naive 'tokenise the categories', and collect the probabilities for each token, and do the whole thing in a batch the size of the categories.

But I assume that sometimes the same category wording can be achieved via multiple token combinations... So I found your library to use instead.

@kddubey
Copy link
Owner

kddubey commented Jan 26, 2024

Makes sense. How long are the completions? Despite the long prompt, it's still surprising to me that 12 completions takes a minute. Maybe the prompt can be refactored a bit to work with cappr.llama_cpp.cache_model. And is there an AutoGPTQ version of the big model?

Also, I found this point from your previous comment interesting:

I need the probabilities (yes I know that we can't use these as confidence values)

Can you elaborate on this / point to a reference? I've barely studied this

@dnhkng
Copy link
Author

dnhkng commented Jan 26, 2024

I just mean the fact that LLMs are trained on cross entropy loss on next tokens. This leads to overconfidence, as the loss does not penalise miscalibration. But maybe that's not a huge issue?

@kddubey
Copy link
Owner

kddubey commented Jan 26, 2024

This leads to overconfidence, as the loss does not penalise miscalibration.

I believe that research is about neural networks leading to overconfidence. Cross entropy (CE) / negative log-likelihood (NLL) is also used to fit logistic regression, for example. A short mathematical argument will make it clear that CE/NLL is great for calibration: the loss is $-\log \hat{y}$, where $\hat{y}$ is the predicted probability of the given class. As $\hat{y}$ gets closer to 0 / further from 1, the loss goes up. In fact, it's unbounded. So CE/NLL penalizes miscalibrated models much more than other losses like mean squared error.

That aside, I take your point. LLMs are NNs, and they shouldn't be expected to be calibrated. Though I think it's worth researching how calibrated CAPPr's probabilities are. Hopefully they turn out to be helpful for your task

@dnhkng
Copy link
Author

dnhkng commented Jan 27, 2024

Exactly. The tests I'm planning are to see how it works in practice.

As far as the algorithm works, does it sum the probabilities of all paths?

Eg, Assume we are using a character-level LLM, and want to calculate the probability for the word "foobar", it can be tokenised as:
[['f', 'o', 'o', 'b', 'a', 'r'], ['f', 'o', 'o', 'b', 'ar'], ['f', 'o', 'o', 'ba', 'r'], ['f', 'o', 'o', 'bar'], ['f', 'o', 'ob', 'a', 'r'], ['f', 'o', 'ob', 'ar'], ['f', 'o', 'oba', 'r'], ['f', 'o', 'obar'], ['f', 'oo', 'b', 'a', 'r'], ['f', 'oo', 'b', 'ar'], ['f', 'oo', 'ba', 'r'], ['f', 'oo', 'bar'], ['f', 'oob', 'a', 'r'], ['f', 'oob', 'ar'], ['f', 'ooba', 'r'], ['f', 'oobar'], ['fo', 'o', 'b', 'a', 'r'], ['fo', 'o', 'b', 'ar'], ['fo', 'o', 'ba', 'r'], ['fo', 'o', 'bar'], ['fo', 'ob', 'a', 'r'], ['fo', 'ob', 'ar'], ['fo', 'oba', 'r'], ['fo', 'obar'], ['foo', 'b', 'a', 'r'], ['foo', 'b', 'ar'], ['foo', 'ba', 'r'], ['foo', 'bar'], ['foob', 'a', 'r'], ['foob', 'ar'], ['fooba', 'r']]

Of course, we can split into chunks, and recursively calculate the probability. We need to calculate the graph, find all the sub-chuncks, and get the logits required to fill in the branches. Once this is done, we can use the probabilities to find the probabilities of each path and back-calculate the average probability per token. Is this what CAPPr does?

@kddubey
Copy link
Owner

kddubey commented Jan 27, 2024

As far as the algorithm works, does it sum the probabilities of all paths?

No. It tokenizes end_of_prompt + completion and calculates the log-probability of each token in there given prompt. If end_of_prompt is a whitespace (almost always practically so, and always implicitly so for SentencePiece tokenizers), then tokenizer(end_of_prompt + completion) is deterministic for every popular BPE and SentencePiece tokenizer. Even if there are nuances I'm not considering, I'd be surprised to find that they're worrisome. Evidence so far suggests that CAPPr performs fine.

See my post about it here and related work for more detail on how it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants