Rockchip NPUs and deploying scikit-learn models on them
My first experience with my RK3588 board was mildly infuriating. I bought my Orange Pi 5Plus for it's quite capable NPU. However the low level matrix multiplication API segfaults every single time. After a long period of headbanging I decided to dump that approach for now and the barely working rknn-toolkit2 high level interface. Even that has it own set of ridiculous problems. I thought converting scikit-learn, the most basic and widely used ML library, would be a breeze. I was wrong. I ended up writing my own converter. With this experience I'll be able to tackle larger and more useful models in the future.
TL;DR (if you just want to use it)
The converter I wrote, scirknn is hostd on GitHub. The core of the project are 2 python files. sklearn2rknn.py and scirknn.py. The former converts scikit-learn models to rknn-toolkit2's format. The latter is a wrapper around rknn-toolkit2 so it behaves like scikit-learn's MLPClassifier/MLPRegressor.
Let's say you have some scikit-learn MLPClassifier. You can convert it to rknn-toolkit2's format in 2 ways. 1. by calling sklearn2rknn.convert or 2. invoking sklearn2rknn as a script.
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
clf = MLPClassifier(random_state=1, max_iter=300)
clf.fit(X_train, y_train)
Either way, you should end up with iris.rknn and iris.rknn.json. These are the actual model and the model's metadata. The metadata generated by sklearn2rknn for the wrapper. To use the model, you can use scirknn as a drop-in replacement for scikit-learn's MLPClassifier. But first, copy iris.rknn and iris.rknn.json to your dev board.
The NPU itself on RK3588 supports operating on many data types. For instance, regualr 32bit floating point, int16, int8, float16 and even int4. However, rknn-toolkit2 only supports int8 and float16. By default without quantizing, the model will be converted to float16. Under this mode, the NPU has a peak performance of 1.5TOPS (but with many asterisks. Most operators besides convolution and data movement does not support multi-core co work). Currently there's no way to fully utilize the 6TOPS of compute as int4 quantization is not supported by the conversion process. The best rknn-toolkit2 can do is 3TOPS with int8 (at the cost of some accuracy, again, asterisks applies). To do so, call sklearn2rknn.convert with the quantization argument and provide an example dataset. RKNN uses the provided dataset to calabrate the quantization.
NPU documentation (which some are in Chinese, good that I do speak it) heavily implies that it is designed as a vision model processor. What I'm doing is really abusing it's capability. Not that I'm going to care, that's the fun right? However, this also means that models must adhere to some strict requirements less stuff starts to run on the CPU.
NPU architecture
The best way to describe is the NPU on the RK3588 is a fixed pipeline dataflow processor with 3 of them on each RK3588 chip. It looks almost like the processor I worked on in collage. But with much simpler control schemes and more flexible dataflow. Not saying the control interface is simple, but it's not executing a program. Instead, there's a huge set of registers that controls how data goes in and out of the NPU. Then once execution starts, the dataflow is fixed. It does not change until the next execution. If you are familiar Texas Instrument's C7x DSP, it's similar to the matrix unit + streaming engine. But directly exposed to the main CPU instead of being a coprocessor of the DSP.
Image: Block diagram of a RK3588 NPU core
The NPU core runs at 1GHz and can perform 2048 int4 operations per cycle, 1024 int8 per cycle or 512 fp16 per cycle. The NPU is also muticore. Each RK3588 SoC comes with 3 NPUs. Which adds up to 6TOPS of compute. However, due to rknn-toolkit2 not supporting quantization to int4, the best we can do is 3TOPS with int8. Even that comes with asterisks. As of RKNN 1.5.0, only convolution and some data movement operators support multi-core. All other operations run on a single core. That includes matrix multiplication, LSTM, GRU, etc.. Thus, if we where attempt to run language models on the NPU, we'll be limited to at best 1TOPS @ int8 or 500 GFLOPS @ fp16.
CPU fallbacks
Due to the NPU being very static, it's not possible to just run arbitrary layers on it. Strict alignment and size constrants apply. When a layer not matching the strict requirement, RKNN runs the layer on the CPU instead. Of course, this is not ideal. Thus we ought to avoid as much as possible.
The detailed list and requirements can be found in the compiler operation manual. But it's in Chinese.
Martin Chang Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict
I run TLGS, a major search engine on Gemini. Used by Buran by default.