Slow on Llama_cpp_python #4

dnhkng · 2024-01-26T16:45:37Z

This seems to work nicely for my use case, but using the llama_cpp_python backend with a long prompt and a dozen completions takes about 1 minute. Any tips on improving speed?

kddubey · 2024-01-26T17:16:08Z

Hi @dnhkng,

Unfortunately cappr.llama_cpp can be slow. This is partly b/c there currently isn't a way to batch over completions like we can in cappr.huggingface. The CAPPr algorithm also doesn't scale well, as discussed here.

2 different routes to improve performance:

Shorten the prompt. Perhaps if you post the prompt and completions we can figure out ways to make it shorter.
Don't use CAPPr. There are other LLM structuring tools which provide a "just pick" functionality. I'm most familiar with the algorithm for guidance.select, and it's quite efficient.

Apologies for the slowness, and thanks for raising the issue

dnhkng · 2024-01-26T18:22:22Z

I do want to use CAPPr.

I need the probabilities (yes I know that we can't use these as confidence values), but I think I have found something useful.

I need to run a few thousand inference over a long prompt, with a large number of categories to test this though.

Any other tips would be welcome.

kddubey · 2024-01-26T18:37:28Z

Are you using a GPU? If not (and assuming it's ok to use a cloud GPU for security/compliance), can you point to the GGUF model you're using? Maybe the model and task will fit on a free or cheap GPU
Can you use a smaller version of the model? Also, if there's a AutoGPTQ version of the model, cappr.huggingface may be faster. Lmk which model you're using
Can you post the prompt and completions (or something similar-enough to work with)? In case you're doing in-context learning / few-shot prompting, consider reducing or eliminating the examples / do a zero-shot prompt. Sometimes, the lift from in-context learning is marginal

kddubey · 2024-01-26T18:49:10Z

Another tip: if all of the prompts start with the same long substring, e.g., system instructions and few-shot examples, use cappr.llama_cpp.cache_model

dnhkng · 2024-01-26T19:13:41Z

Thanks for the tips!

yes, using a beefy GPU
A big model is required, unfortunately.
the prompt is necessarily long, no way around that.

I'm currently using Outlines for category selection, but I want to try a new approach that requires probabilities. It initially, I planned on doing the naive 'tokenise the categories', and collect the probabilities for each token, and do the whole thing in a batch the size of the categories.

But I assume that sometimes the same category wording can be achieved via multiple token combinations... So I found your library to use instead.

kddubey · 2024-01-26T19:35:05Z

Makes sense. How long are the completions? Despite the long prompt, it's still surprising to me that 12 completions takes a minute. Maybe the prompt can be refactored a bit to work with cappr.llama_cpp.cache_model. And is there an AutoGPTQ version of the big model?

Also, I found this point from your previous comment interesting:

I need the probabilities (yes I know that we can't use these as confidence values)

Can you elaborate on this / point to a reference? I've barely studied this

dnhkng · 2024-01-26T21:14:49Z

I just mean the fact that LLMs are trained on cross entropy loss on next tokens. This leads to overconfidence, as the loss does not penalise miscalibration. But maybe that's not a huge issue?

kddubey · 2024-01-26T22:58:48Z

This leads to overconfidence, as the loss does not penalise miscalibration.

I believe that research is about neural networks leading to overconfidence. Cross entropy (CE) / negative log-likelihood (NLL) is also used to fit logistic regression, for example. A short mathematical argument will make it clear that CE/NLL is great for calibration: the loss is $-\log \hat{y}$, where $\hat{y}$ is the predicted probability of the given class. As $\hat{y}$ gets closer to 0 / further from 1, the loss goes up. In fact, it's unbounded. So CE/NLL penalizes miscalibrated models much more than other losses like mean squared error.

That aside, I take your point. LLMs are NNs, and they shouldn't be expected to be calibrated. Though I think it's worth researching how calibrated CAPPr's probabilities are. Hopefully they turn out to be helpful for your task

dnhkng · 2024-01-27T07:37:01Z

Exactly. The tests I'm planning are to see how it works in practice.

As far as the algorithm works, does it sum the probabilities of all paths?

Eg, Assume we are using a character-level LLM, and want to calculate the probability for the word "foobar", it can be tokenised as:
[['f', 'o', 'o', 'b', 'a', 'r'], ['f', 'o', 'o', 'b', 'ar'], ['f', 'o', 'o', 'ba', 'r'], ['f', 'o', 'o', 'bar'], ['f', 'o', 'ob', 'a', 'r'], ['f', 'o', 'ob', 'ar'], ['f', 'o', 'oba', 'r'], ['f', 'o', 'obar'], ['f', 'oo', 'b', 'a', 'r'], ['f', 'oo', 'b', 'ar'], ['f', 'oo', 'ba', 'r'], ['f', 'oo', 'bar'], ['f', 'oob', 'a', 'r'], ['f', 'oob', 'ar'], ['f', 'ooba', 'r'], ['f', 'oobar'], ['fo', 'o', 'b', 'a', 'r'], ['fo', 'o', 'b', 'ar'], ['fo', 'o', 'ba', 'r'], ['fo', 'o', 'bar'], ['fo', 'ob', 'a', 'r'], ['fo', 'ob', 'ar'], ['fo', 'oba', 'r'], ['fo', 'obar'], ['foo', 'b', 'a', 'r'], ['foo', 'b', 'ar'], ['foo', 'ba', 'r'], ['foo', 'bar'], ['foob', 'a', 'r'], ['foob', 'ar'], ['fooba', 'r']]

Of course, we can split into chunks, and recursively calculate the probability. We need to calculate the graph, find all the sub-chuncks, and get the logits required to fill in the branches. Once this is done, we can use the probabilities to find the probabilities of each path and back-calculate the average probability per token. Is this what CAPPr does?

kddubey · 2024-01-27T14:43:18Z

As far as the algorithm works, does it sum the probabilities of all paths?

No. It tokenizes end_of_prompt + completion and calculates the log-probability of each token in there given prompt. If end_of_prompt is a whitespace (almost always practically so, and always implicitly so for SentencePiece tokenizers), then tokenizer(end_of_prompt + completion) is deterministic for every popular BPE and SentencePiece tokenizer. Even if there are nuances I'm not considering, I'd be surprised to find that they're worrisome. Evidence so far suggests that CAPPr performs fine.

See my post about it here and related work for more detail on how it works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow on Llama_cpp_python #4

Slow on Llama_cpp_python #4

dnhkng commented Jan 26, 2024

kddubey commented Jan 26, 2024

dnhkng commented Jan 26, 2024

kddubey commented Jan 26, 2024 •

edited

kddubey commented Jan 26, 2024 •

edited

dnhkng commented Jan 26, 2024

kddubey commented Jan 26, 2024 •

edited

dnhkng commented Jan 26, 2024

kddubey commented Jan 26, 2024

dnhkng commented Jan 27, 2024 •

edited

kddubey commented Jan 27, 2024 •

edited

Slow on Llama_cpp_python #4

Slow on Llama_cpp_python #4

Comments

dnhkng commented Jan 26, 2024

kddubey commented Jan 26, 2024

dnhkng commented Jan 26, 2024

kddubey commented Jan 26, 2024 • edited

kddubey commented Jan 26, 2024 • edited

dnhkng commented Jan 26, 2024

kddubey commented Jan 26, 2024 • edited

dnhkng commented Jan 26, 2024

kddubey commented Jan 26, 2024

dnhkng commented Jan 27, 2024 • edited

kddubey commented Jan 27, 2024 • edited

kddubey commented Jan 26, 2024 •

edited

kddubey commented Jan 26, 2024 •

edited

kddubey commented Jan 26, 2024 •

edited

dnhkng commented Jan 27, 2024 •

edited

kddubey commented Jan 27, 2024 •

edited