Using dynamic input shapes on RKNN/RK3588

Quick documentation for my self.

Today, I dabbled into accelerating TTS using the RK3588's NPU. It works really well! I'm seeing a real time factor (RTF) of 0.15 during my initial tests, and I believe I can push it even further. One thing I had to do was to use dynamic input shapes. RKNN traditionally requires you to specify the input shape of the model during build time. This doesn't work for me as it is really impossible to force sentences to be of a certain length. I'm using VITS and tryign to accelerate the decoder part of the model. Even with manual chunking, I still need the model to be able to accept variable length inputs to be efficient.

Say the input to the decoder is of shape [1, 192, 55], where 55 is the number of compressed speech frames. If I compile the model to always accept 55 frames, then I will have to pad the input to 55 frames when I don't have enough. This can be especially slow when synthesizing short sentences. The RKNN user guide mentions that you can use dynamic input shapes, but the documentation is not very clear. I ended up messing around with the RKNN API and found out how to do it.

In which, it says:

# For example, when using RKNN-Toolkit2 to convert a Caffe model, the python code example is as follows:
dynamic_input = [
    [[1,3,224,224]], # set the first shape for all inputs
    [[1,3,192,192]], # set the second shape for all inputs
    [[1,3,160,160]], # set the third shape for all inputs
# Pre-process config
rknn.config(mean_values=[103.94, 116.78, 123.68],
    std_values=[58.82, 58.82, 58.82], quant_img_RGB2BGR=True,

What the actual heck is this? I tried to use this code, but whenever I try to load the source ONNX model, I get an error saying that the model have missing input axis and I need to fill in manually.

Turns out, I need to specify the input shape regardless if I enable dynamic input shapes or not. Elsewhere in the documetns, it suggests the RKNN compiler will build against the supplied input shape (using build very loosely here), and then use the dynamic input shapes during runtime. Thus, the "dynamic input shapes" won't work when changing the input shape leads to a constant data/graph change.

Note that, The order of input in inputs must also match the order in dynamic_input. The following is the code I use to build the model:

rknn = RKNN()
    dynamic_input=[[[1, 192, 55], [1, 1, 55], [1, 512, 1]],
                   [[1, 192, 50], [1, 1, 50], [1, 512, 1]],
                   [[1, 192, 24], [1, 1, 24], [1, 512, 1]],
                   [[1, 192, 20], [1, 1, 20], [1, 512, 1]]],
    input_size_list=[[1, 192, 55], [1, 1, 55], [1, 512, 1]],
    inputs=['z', 'y_mask', 'g'],

There's nothing special needed during inference. Just feed the model with the correct input shape and it will work.

This is more then stupid. It works but... yeah. That's 1 hour of my life I won't get back. Plus 20min of writing this blog post. I hope this helps someone else.

Author's profile. Photo taken in VRChat by my friend Tast+
Martin Chang
Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict

I run TLGS, a major search engine on Gemini. Used by Buran by default.

  • marty1885 \at
  • Matrix:
  • Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df