Update on GGML RKNPU2 backend and RKNPU2 1.6.0

Recently Rockchip released a new version of the RKNPU2 SDK. It enabled larger matrix multiplcation of up to K=10240 and int4 support. My last post describes briefly how I build the RKNPU2 backend for GGML. This time, I want to share what I am able to achieve with the new SDK release.

My last post on building the RKNPU2 backend for GGML

Before I start, someone beat me to addapt the new SDK to the RKNPU2 backend. I am very happy to see someone forked my repo and adds supposrt fot the the new SDK. Damn Rockchip is bad at keeping backward compatibility. I merged his changes into my repo and developed on top of it. His changes includes, but not limited to:

Use FP16 instead of INT8 for matrix multiplication (slow, but we know if LLaMA is actually working)
Changes to adapt to the new SDK
Code cleanup
Debug print
Sync up with the latest LLaMA.cpp

I made my own changes to his code, including but not limited to:

Fix FP16 reordering
Re-implement INT8 support

So far I got FP16 working perfectly generating coherent results. (But much slower then CPU only).

> bin/main -m llama-2-7b.Q8_0.gguf  -p "The world is a " -n 512  -t 0 -ngl 20 --top-k 1 # Compiled with FP16
...
he world is a 3D place. We live in it, we work in it and we play in it.
The world is also a 2D place. We see it on TV, we read about it in books and we experience it through the eyes of others.
But what if you could experience the world as it really is? What if you could walk around your house or office and see everything that’s happening right now?
What if you could go to a concert and hear every note being played, even if you were sitting in the back row?
What if you could watch a movie and feel like you were actually there?
That’s what virtual reality is all about. It’s not just about gaming anymore; it’s about experiencing life as it really is. And that’s why we’re so excited to be working with Oculus Rift, the world’s leading VR platform.
Oculus Rift is a virtual reality headset that allows you to experience 3D environments in an immersive way. It’s like being transported into another world, and it’s incredibly realistic.

INT8 is interesting. I fixed a few bugs and it is doing much better now. But it can still be incohorent at times. However! Notice that I can enable 20 layers on the NPU and still be somewhat coherent. This is a huge improvement over the previous version. I am still working on it and see if I can work out the kinks.

> bin/main -m llama-2-7b.Q8_0.gguf  -p "The world is a " -n 512  -t 0 -ngl 20 --top-k 1 # Compiled with INT8
...
The world is a 1950s-like place.
The world is a 1950s-like place. The world is a 1950s-like place. The world is a 1950s-like place. The world is a 1950s-like place. The world is a 1950s-like place. The world is a 1950s-like place. The world is a 1950s-like place. The world is a 1950s-like place. The world is a 1950s-like place. The world is a 1950s-like place. The world is a 1950s-like place.

The output on both are, however, slower then CPU only. But we can see that FP16's output closely follows the CPU only output initially. But diverges as the output gets longer. INT8 just doesn't stand a chance.

bin/main -m llama-2-7b.Q8_0.gguf  -p "The world is a " -n 512  -t 0 -ngl 0 --top-k 1 # CPU only
...
The world is a 3D place. We live in it, we work in it and we play in it. But how do we capture the essence of this world? How can we share our experiences with others?
The answer lies in Virtual Reality (VR). VR is an immersive experience that allows us to explore new places, learn about different cultures and even feel like we’re right there in the middle of it all. It’s a powerful tool for education, entertainment and communication.
But what exactly is VR? And how does it work? In this blog post, we’ll take a closer look at the technology behind Virtual Reality and explore some of its most exciting applications. We’ll also discuss the future of VR and how it could change our lives forever. So if you want to learn more about this amazing new world, read on!

The following are benchmarks on processing the prompt In OpenBSD documentation, the number in parentheses after "pledge" represents the number of promises that a process is making to the operating system. It is a security feature that allows a program to limit its privileges and access to resources. In the case of "pledge(2)," it means the process is making two promises to the operating system. and generates 1 token with the LLaMA2-chat-7B model. It shows that with the Q8 quantization, the NPU can be fast enough to beat the CPU at large batch sizes. But loses at lower quantization.

Q4_K_S on CPU

llama_print_timings:        load time =   11375.61 ms
llama_print_timings:      sample time =       0.22 ms /     1 runs   (    0.22 ms per token,  4504.50 tokens per second)
llama_print_timings: prompt eval time =   33841.50 ms /    75 tokens (  451.22 ms per token,     2.22 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   33842.20 ms

Q8 on CPU

llama_print_timings:        load time =    1943.71 ms
llama_print_timings:      sample time =       0.21 ms /     1 runs   (    0.21 ms per token,  4716.98 tokens per second)
llama_print_timings: prompt eval time =   48348.28 ms /    75 tokens (  644.64 ms per token,     1.55 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   48350.05 ms

Q8 on RK3588 NPU (all layers, INT8 mode, impractical, error too large)

llama_print_timings:        load time =   24271.59 ms
llama_print_timings:      sample time =       0.77 ms /     1 runs   (    0.77 ms per token,  1297.02 tokens per second)
llama_print_timings: prompt eval time =   34261.36 ms /    75 tokens (  456.82 ms per token,     2.19 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   34263.36 ms

Q8 on RK3588 NPU (20 layers, INT8 mode)

llama_print_timings:        load time =   16006.71 ms
llama_print_timings:      sample time =       0.33 ms /     1 runs   (    0.33 ms per token,  3058.10 tokens per second)
llama_print_timings: prompt eval time =   39399.73 ms /    75 tokens (  525.33 ms per token,     1.90 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)

Q8 on RK3588 NPU (20 layers. FP16 mode)

llama_print_timings:        load time =   19853.35 ms
llama_print_timings:      sample time =       0.22 ms /     1 runs   (    0.22 ms per token,  4608.29 tokens per second)
llama_print_timings: prompt eval time =   40314.11 ms /    75 tokens (  537.52 ms per token,     1.86 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   40315.74 ms

Some issue still remains. But getting there. I think we really need Rockchip to support mixed point precision like FP16 multiplies with INT8 to resolve the issue. Or better FP16 inputs with INT4 weights. I can only hope.

Both FP16 and INT8 are slower then CPU only at low batch sizes
INT8 is not coherent

You can find the code here:

My development fork of llama.cpp with the RKNPU2 backend

Martin Chang

Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict

I run TLGS, a major search engine on Gemini. Used by Buran by default.

martin \at clehaxze.tw
Matrix: @clehaxze:matrix.clehaxze.tw
Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df