Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama-node/llama-cpp uses more memory than standalone llama.cpp with the same parameters #85

Open
fardjad opened this issue May 28, 2023 · 3 comments

Comments

@fardjad
Copy link
Contributor

fardjad commented May 28, 2023

I'm trying to process a large text file. For the sake of reproducibility, let's use this. The following code:

Expand to see the code
import { LLM } from "llama-node";
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js";
import path from "node:path";
import fs from "node:fs";

const model = path.resolve(
    process.cwd(),
    "/path/to/model.bin"
);
const llama = new LLM(LLamaCpp);
const prompt = fs.readFileSync("./path/to/file.txt", "utf-8");

await llama.load({
    enableLogging: true,
    modelPath: model,

    nCtx: 4096,
    nParts: -1,
    seed: 0,
    f16Kv: false,
    logitsAll: false,
    vocabOnly: false,
    useMlock: false,
    embedding: false,
    useMmap: false,
    nGpuLayers: 0,
});

await llama.createCompletion(
    {
        nThreads: 8,
        nTokPredict: 256,
        topK: 40,
        prompt,
    },
    (response) => {
        process.stdout.write(response.token);
    }
);

Crashes the process with a segfault error:

ggml_new_tensor_impl: not enough space in the scratch memory
segmentation fault  node index.mjs

When I compile the exact same version of llama.cpp and run it with the following args:

./main -m /path/to/ggml-vic7b-q5_1.bin -t 8 -c 4096 -n 256 -f ./big-input.txt

It runs perfectly fine (of course with a warning that the context is larger than what the model supports but it doesn't crash with a segfault).

Comparing the logs:

llama-node Logs
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4936280.75 KB
llama_model_load_internal: mem required  = 6612.59 MB (+ 2052.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  = 4096.00 MB
[Sun, 28 May 2023 14:35:50 +0000 - INFO - llama_node_cpp::context] - AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
[Sun, 28 May 2023 14:35:50 +0000 - INFO - llama_node_cpp::llama] - tokenized_stop_prompt: None
ggml_new_tensor_impl: not enough space in the scratch memory
llama.cpp Logs
main: warning: model does not support context sizes greater than 2048 tokens (4096 specified);expect poor results
main: build = 561 (5ea4339)
main: seed  = 1685284790
llama.cpp: loading model from ../my-llmatic/models/ggml-vic7b-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  72.75 KB
llama_model_load_internal: mem required  = 6612.59 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 2048.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = 256, n_keep = 0

Looks like the context size in llama-node is set to 4GBs and the kv self size is twice as large as what llama.cpp used.

I'm not sure if I'm missing something in my Load/Invocation config or if that's an issue in llama-node. Can you please have a look?

@hlhr202
Copy link
Member

hlhr202 commented May 29, 2023

sure, will look into this soon.

@hlhr202
Copy link
Member

hlhr202 commented May 29, 2023

I guess it was caused by useMmap?
llama.cpp will enable useMmap by default. What I v found in your llama-node code example, you seems did not enable mmap for reusing file cache in the memory, that is probably why you run out of memory I think?

@fardjad
Copy link
Contributor Author

fardjad commented May 29, 2023

I'm afraid that is not the case. Before you updated the version of llama.cpp, I couldn't run my example (with or without setting useMmap). Now it doesn't crash, but it doesn't seem to be doing anything either.

I recorded a video comparing llama-node and llama.cpp:

llama-node-issue.mp4

As you can see, llama-node sort of freezes with the larger input, whereas llama.cpp starts emitting tokens after ~30 secs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants