[Serving] PagedKVCache Quantization#2663

davidpissarra · 2024-07-16T00:53:47Z

KV Cache might be a burden under tight memory constraints, and cache quantization can reduce its memory requirement by roughly 75% (float16 -> int3 KV cache). As a result, this PR intends to introduce initial support for KV Cache quantization, introducing two new quant schemes (q3f16kv and q4f16kv).

On Llama-3 (context_window_size = 8192) kv cache memory usage:

1113 MB (q4f16_1: float16 kv cache)
345 MB (q4f16kv: int4 kv cache)
281 MB (q3f16kv: int3 kv cache)

Depends on: apache/tvm#17159

cc @MasterJH5574 @tqchen

XJY990705 · 2024-09-05T02:51:16Z

I want to use this pr, and I notice the TVM branch it use is f5f048b, but I got some bugs while compiling mlc-llm. Would you please tell me the TVM version you use in this pr? I'm sure the problems I got is concerned with wrong TVM version

davidpissarra · 2024-09-05T13:25:17Z

Hi @XJY990705 , it is still actually on f5f048b. You may be able to run it if you build everything from this branch (including tvm). I will rebase it in the meantime.

XJY990705 · 2024-09-06T02:10:42Z

@davidpissarra thank you for your reply, I will try again

XJY990705 · 2024-09-07T09:15:56Z

I have already solved this problem by this pr tlc-pack/libflash_attn#8
maybe when I swiched to f5f048b branch, this modification is lost. Anyway, thank you for your help!!

XJY990705 · 2024-09-09T08:03:41Z

@davidpissarra I noticed 3rdparty/tvm/src/runtime/relax_vm/paged_kv_cache.cc is not changed, and mismatched with python/mlc_llm/nn/kv_cache.py when calling TVM_REGISTER_GLOBAL("vm.builtin.paged_attention_kv_cache_create_reduced"). Is there any commit you forget?

kazunator · 2024-10-11T04:09:49Z

Is this still being worked on, or is there already a kv cache quantization implemented?

davidpissarra · 2024-10-11T15:47:10Z

Is this still being worked on, or is there already a kv cache quantization implemented?

Hi @kazunator. It is implemented indeed. Since it hasn't been merged yet, you should be able to use it by building this branch and then using the q3f16kv or q4f16kv quantization schemes.

kazunator · 2024-10-11T22:54:26Z

@davidpissarra do you have example usage I can run quickly? Or just where I can specify the quantization scheme?

davidpissarra · 2024-10-12T18:12:37Z

@kazunator you should be able to run it by following the typical MLC model compilation flow (the following steps should be enough). For more details, feel free refer to https://llm.mlc.ai/docs/compilation/compile_models.html

# 1. Convert model parameters mlc_llm convert_weight ./dist/models/Meta-Llama-3-8B/ \ --quantization q4f16kv \ -o dist/Meta-Llama-3-8B-q4f16kv-MLC # 2. gen_config: generate mlc-chat-config.json and process tokenizers mlc_llm gen_config ./dist/models/Meta-Llama-3-8B/ \ --quantization q4f16kv --conv-template llama \ -o dist/Meta-Llama-3-8B-q4f16kv-MLC/ # 3. compile: compile model library with specification in mlc-chat-config.json mlc_llm compile ./dist/Meta-Llama-3-8B-q4f16kv-MLC/mlc-chat-config.json \ --device cuda -o dist/libs/Meta-Llama-3-8B-q4f16kv-cuda.so

kazunator · 2024-10-13T07:15:20Z

@davidpissarra oh that's cool. Is it possible to use it with just python tho? I'm trying to get it running on collab to test out your branch and I don't have a big GPU with me rn. I can test it on company A100 later in the week

XJY990705 · 2024-12-21T13:34:34Z

Sorry to bother you again, I tried your method and I got the exact ppl for the same datasets using int3 quantization, int4 quantization and no quantization. And I want to check the kv cache value to see if it has been quantized, but add print lines in tir functions is useless(maybe removed after compilation). Do you have some options to debug and print the kv cache value? I want to print the value before and after quantization for error calculation.

kv quantize
d2acb93

davidpissarra mentioned this pull request Jul 16, 2024
[KVCache] PagedKVCache Quantization apache/tvm#17159
Open

XJY990705 mentioned this pull request Sep 4, 2024
[Tracking] PagedKVCache Quantization #2663 中的TVM版本 #2880
Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serving] PagedKVCache Quantization#2663

[Serving] PagedKVCache Quantization #2663

davidpissarra commented Jul 16, 2024•
edited
Loading

XJY990705 commented Sep 5, 2024

davidpissarra commented Sep 5, 2024•
edited
Loading

XJY990705 commented Sep 6, 2024

XJY990705 commented Sep 7, 2024

XJY990705 commented Sep 9, 2024

kazunator commented Oct 11, 2024

davidpissarra commented Oct 11, 2024

kazunator commented Oct 11, 2024

davidpissarra commented Oct 12, 2024

kazunator commented Oct 13, 2024

XJY990705 commented Dec 21, 2024

[Serving] PagedKVCache Quantization#2663

Are you sure you want to change the base?

[Serving] PagedKVCache Quantization #2663

Conversation

davidpissarra commented Jul 16, 2024• edited Loading

XJY990705 commented Sep 5, 2024

davidpissarra commented Sep 5, 2024• edited Loading

XJY990705 commented Sep 6, 2024

XJY990705 commented Sep 7, 2024

XJY990705 commented Sep 9, 2024

kazunator commented Oct 11, 2024

davidpissarra commented Oct 11, 2024

kazunator commented Oct 11, 2024

davidpissarra commented Oct 12, 2024

kazunator commented Oct 13, 2024

XJY990705 commented Dec 21, 2024

davidpissarra commented Jul 16, 2024•
edited
Loading

davidpissarra commented Sep 5, 2024•
edited
Loading