Mixtile Blade 3 (RK3588) OpenCL performance

home

Mixtile Blade 3[1] is an intresting dev board. It runs on a RK 3855 SoC, the successor of the RK3399. Which a whole lot of other boards uses. Including QuartzPro64[2], ITX-3588J[3] and Rock Pi 5[4]. The 16GB model Blade 3 is priced at $369, much more expensive then the Rock Pi 5 at 189$ and the expected price of QuartzPro64 at ~$300.

Mixtile Blade 3 however, has a trick up it's sleve. It allows networking to other Mixtile Blade 3s directly through PCIe, up to what I assume to be 32Gbps (4GB/s). And have a custom cluster case to house 4 nodes in a single box. The vendor have helpfully setup a demo machine for customers to login an try the board before purchase. So I took the liberty and ran some OpenCL benchmark. I too need some numbers to decide if this is a good board.

[1]: Mixtile blade 3

[2]: Pine64 QuartzPro64

[3]: Firefly ITX-3588J

[4]: Rock Pi 5

root@blade3:~/clpeak/build# free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi       520Mi        13Gi        33Mi       1.3Gi        14Gi
Swap:             0B          0B          0B

root@blade3:~/clpeak/build# ./clpeak 

Platform: ARM Platform
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
  Device: Mali-LODX r0p0
    Driver version  : 2.1 (Linux ARM64)
    Compute units   : 4
    Clock frequency : 1000 MHz

    Global memory bandwidth (GBPS)
      float   : 24.90
      float2  : 26.84
      float4  : 27.19
      float8  : 13.56
      float16 : 13.11

    Single-precision compute (GFLOPS)
      float   : 248.73
      float2  : 470.16
      float4  : 466.81
      float8  : 435.33
      float16 : 411.15

    Half-precision compute (GFLOPS)
      half   : 441.93
      half2  : 878.47
      half4  : 909.91
      half8  : 886.29
      half16 : 845.66

    No double precision support! Skipped

    Integer compute (GIOPS)
      int   : 125.12
      int2  : 125.74
      int4  : 125.19
      int8  : 123.79
      int16 : 124.38

    Integer compute Fast 24bit (GIOPS)
      int   : 125.30
      int2  : 125.82
      int4  : 125.16
      int8  : 123.82
      int16 : 124.39

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 8.15
      enqueueReadBuffer               : 9.28
      enqueueWriteBuffer non-blocking : 8.13
      enqueueReadBuffer non-blocking  : 9.29
      enqueueMapBuffer(for read)      : 64.52
        memcpy from mapped ptr        : 10.29
      enqueueUnmap(after write)       : 65.33
        memcpy to mapped ptr          : 10.20

    Kernel launch latency : 59.75 us

Some intresting things see from the above result:

24bit integer FMA is not supported.
Map buffer + memcpy is faster then direct read/write
There's a 4-to-1 ratio on FP vs integer pipeline.
Either that or the integer compute runs on a scaler unit.

Here's the result from mixbench[5]. Which tells the same story.

[5]: mixbench - benchmark tool for evaluating GPUs on mixed operational intensity kernels

root@blade3:~/mixbench/mixbench-opencl/build# ./mixbench-ocl-ro
mixbench-ocl/read-only (v0.04)
Use "-h" argument to see available options
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
------------------------ Device specifications ------------------------
Platform:            ARM Platform
Device:              Mali-LODX r0p0/ARM
Driver version:      2.1
Address bits:        64
GPU clock rate:      1000 MHz
Total global mem:    15699 MB
Max allowed buffer:  15699 MB
OpenCL version:      OpenCL 2.1 v1.g6p0-01eac0.efb75e2978d783a80fe78be1bfb0efc1
Total CUs:           4
-----------------------------------------------------------------------
Buffer size:            256MB
Workgroup size:         256
Elements per workitem:  8
Workitem fusion degree: 4
Workitem stride:        NDRange
Buffer allocation:      Device allocated
Timer:                  CL event based
Warning:                Double precision computations are not supported
Loading kernel source file...
Precompilation of kernels... [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>]
----------------------------------------------------------------------------- CSV data -----------------------------------------------------------------------------
Experiment ID, Single Precision ops,,,,              Double precision ops,,,,              Half precision ops,,,,                Integer operations,,, 
Compute iters, Flops/byte, ex.time,  GFLOPS, GB/sec, Flops/byte, ex.time,  GFLOPS, GB/sec, Flops/byte, ex.time,  GFLOPS, GB/sec, Iops/byte, ex.time,   GIOPS, GB/sec
            0,      0.250,    7.66,    4.38,  17.53,      0.125,    0.00,     inf,    inf,      0.500,    7.63,    8.79,  17.59,     0.250,    7.64,    4.39,  17.56
            1,      0.750,    7.55,   13.33,  17.77,      0.375,    0.00,     inf,    inf,      1.500,    7.55,   26.68,  17.79,     0.750,   11.85,    8.49,  11.32
            2,      1.250,    7.60,   22.08,  17.66,      0.625,    0.00,     inf,    inf,      2.500,    7.58,   44.25,  17.70,     1.250,    7.51,   22.34,  17.87
            3,      1.750,    7.56,   31.06,  17.75,      0.875,    0.00,     inf,    inf,      3.500,    7.56,   62.14,  17.75,     1.750,    7.74,   30.33,  17.33
            4,      2.250,    7.54,   40.06,  17.81,      1.125,    0.00,     inf,    inf,      4.500,    7.58,   79.65,  17.70,     2.250,    8.84,   34.16,  15.18
            5,      2.750,    7.55,   48.89,  17.78,      1.375,    0.00,     inf,    inf,      5.500,    7.57,   97.47,  17.72,     2.750,   10.28,   35.92,  13.06
            6,      3.250,    7.56,   57.73,  17.76,      1.625,    0.00,     inf,    inf,      6.500,    7.64,  114.24,  17.58,     3.250,   11.79,   37.00,  11.38
            7,      3.750,    7.67,   65.66,  17.51,      1.875,    0.00,     inf,    inf,      7.500,    7.70,  130.74,  17.43,     3.750,    9.21,   54.63,  14.57
            8,      4.250,    5.21,  109.44,  25.75,      2.125,    0.00,     inf,    inf,      8.500,    5.20,  219.21,  25.79,     4.250,    5.17,  110.28,  25.95
            9,      4.750,    5.20,  122.64,  25.82,      2.375,    0.00,     inf,    inf,      9.500,    5.22,  244.17,  25.70,     4.750,    5.51,  115.64,  24.35
           10,      5.250,    5.19,  135.72,  25.85,      2.625,    0.00,     inf,    inf,     10.500,    5.20,  270.86,  25.80,     5.250,    5.99,  117.68,  22.42
           11,      5.750,    5.21,  148.07,  25.75,      2.875,    0.00,     inf,    inf,     11.500,    5.21,  296.32,  25.77,     5.750,    6.47,  119.23,  20.74
           12,      6.250,    5.22,  160.78,  25.72,      3.125,    0.00,     inf,    inf,     12.500,    5.21,  321.74,  25.74,     6.250,    6.99,  120.01,  19.20
           13,      6.750,    5.20,  174.38,  25.83,      3.375,    0.00,     inf,    inf,     13.500,    5.18,  349.77,  25.91,     6.750,    7.49,  120.88,  17.91
           14,      7.250,    5.21,  186.84,  25.77,      3.625,    0.00,     inf,    inf,     14.500,    5.21,  373.61,  25.77,     7.250,    8.02,  121.30,  16.73
           15,      7.750,    5.20,  200.21,  25.83,      3.875,    0.00,     inf,    inf,     15.500,    5.19,  400.58,  25.84,     7.750,    8.63,  120.60,  15.56
           16,      8.250,    5.19,  213.29,  25.85,      4.125,    0.00,     inf,    inf,     16.500,    5.20,  426.15,  25.83,     8.250,    9.15,  121.05,  14.67
           17,      8.750,    5.20,  226.04,  25.83,      4.375,    0.00,     inf,    inf,     17.500,    5.21,  450.53,  25.74,     8.750,    9.66,  121.51,  13.89
           18,      9.250,    5.18,  239.84,  25.93,      4.625,    0.00,     inf,    inf,     18.500,    5.19,  478.80,  25.88,     9.250,   10.19,  121.84,  13.17
           20,     10.250,    5.17,  266.06,  25.96,      5.125,    0.00,     inf,    inf,     20.500,    5.19,  530.25,  25.87,    10.250,   11.24,  122.37,  11.94
           22,     11.250,    5.18,  291.23,  25.89,      5.625,    0.00,     inf,    inf,     22.500,    5.19,  581.48,  25.84,    11.250,   12.29,  122.86,  10.92
           24,     12.250,    5.21,  315.78,  25.78,      6.125,    0.00,     inf,    inf,     24.500,    5.19,  633.22,  25.85,    12.250,   13.33,  123.32,  10.07
           28,     14.250,    5.27,  362.84,  25.46,      7.125,    0.00,     inf,    inf,     28.500,    5.37,  712.89,  25.01,    14.250,   15.44,  123.85,   8.69
           32,     16.250,    5.63,  387.39,  23.84,      8.125,    0.00,     inf,    inf,     32.500,    5.77,  755.59,  23.25,    16.250,   17.54,  124.32,   7.65
           40,     20.250,    6.61,  410.87,  20.29,     10.125,    0.00,     inf,    inf,     40.500,    6.78,  801.87,  19.80,    20.250,   21.76,  124.90,   6.17
           48,     24.250,    7.68,  423.68,  17.47,     12.125,    0.00,     inf,    inf,     48.500,    7.82,  831.97,  17.15,    24.250,   25.99,  125.23,   5.16
           56,     28.250,    8.74,  433.81,  15.36,     14.125,    0.00,     inf,    inf,     56.500,    8.89,  852.87,  15.10,    28.250,   30.23,  125.42,   4.44
           64,     32.250,    9.80,  441.61,  13.69,     16.125,    0.00,     inf,    inf,     64.500,    9.91,  873.93,  13.55,    32.250,   34.48,  125.53,   3.89
           80,     40.250,   11.92,  453.21,  11.26,     20.125,    0.00,     inf,    inf,     80.500,   12.06,  895.56,  11.13,    40.250,   43.00,  125.64,   3.12
           96,     48.250,   14.07,  460.37,   9.54,     24.125,    0.00,     inf,    inf,     96.500,   14.16,  914.98,   9.48,    48.250,   51.59,  125.54,   2.60
          128,     64.250,   18.36,  469.77,   7.31,     32.125,    0.00,     inf,    inf,    128.500,   18.40,  937.59,   7.30,    64.250,   94.20,   91.55,   1.42
          192,     96.250,  108.46,  119.11,   1.24,     48.125,    0.00,     inf,    inf,    192.500,  114.73,  225.20,   1.17,    96.250,  140.42,   92.00,   0.96
          256,    128.250,  144.26,  119.32,   0.93,     64.125,    0.00,     inf,    inf,    256.500,  153.10,  224.86,   0.88,   128.250,  186.91,   92.09,   0.72
--------------------------------------------------------------------------------------------------------------------------------------------------------------------

I tried to get smallptGPU to work. But it ended up needing X11 to run. I can install a dummy X11 on their demo system. But decided I'm not going that far on someone else's system that is provided for free. From the numbers above. One of a Mixtile 3 16G board is GPU wise is about 1/3 of a Nvidia Xavier board. Pricing at 1/3 of the price but with much more connectivity.

I am very exited with the future the next generation PinePhone Pro with a RK3588. Up to 32GB RAM, 8 cores and a very nice integrated GPU. A future revision of Mixtile with DDR5 memory would be amazing. GPU computing wise, this board is intresting. Normally we expect 1:1 or 1:2 floating point to interger computation ratio. But the Mali GPU on the RK3855 is a 1:4. This make this board unsuitable for applications like crypto mining, hash cracking, and some scientific computation. To be fair, this is a mobile SoC. So these are not it's intended use.

My use case for the board

All in all, not too shabby. 470 GFLOPS on an embedded system? That's faster than my laptop's Intel HD 620 integrated GPU at 380 GFLOPS. I'd also assume 26GB/s is the avaliable memory from the GPU. This is definitely one of the nicest board I've ever seen. I guess sutable for a hyper-converged ARM cluster. Besides nice CPU and GPU, each node equipped with a 6TOPS NPU for AI inference, a SATA port (need breakout cable) for storage, 2 2.5Gb ethernet ports, and like I said above, 2 8G PCIe links that can act as direct network interfaces.

I want one or more of these boards to add to my home lab. My current environment is a cluster of my HoneyComb LX2K and a Raspberry Pi. The RPi is more of a a monitor in case I need direct access to the HoneyComb's BMC. These Mixtile 3 can add quite a lot computing power. With the NPUs I think I can start adding BERT to my Gemini search engine. I clould also run SALSA on GPU, reducing some search time. Still, the upgrade is quite expensive for some gain. All while I'm not out of storage space. Nor running into CPU limits.

I do want and need more nodes for my ongoing project using OpenDHT as a Web3 base system (man, I hate that term, I'd call it decentralized applications). With high speed networking, It'll be easier to spot data races and scalablity issues in development. I also need a backup plain in case my current server goes down for good. Mixtile 3 with a SATA SSD should be up to the task.

If someone were to donate a pile of Mixtile 3, I'd be very happy to run them as a cluster and start a business. IDK, some cheap sever-less service that I can rent out. And allow AI inference via API calls and load balancing. Or some sponsered Web3 research.

Bonus: Mesa Clover OpenCL performance on ARM + AMD Polaris

I also tested the GPU performance on my server with the Mesa and Clover driver. It's known to be a bit slow, but I'm suprised by how slow it is. 4GB/s VRAM bandwidth. Wow.... I need to upgrade to ROCm instead of using Clover. (Oh, the GPU is a RX 560. It's the same as the RX 570, but with a different core congituation.)

❯ ./clpeak

Platform: Clover
  Device: AMD Radeon RX 570 Series (polaris10, LLVM 14.0.6, DRM 3.40, 5.10.35-00001-g107b6c90afff)
    Driver version  : 22.1.3 (Linux ARM64)
    Compute units   : 32
    Clock frequency : 1244 MHz

    Global memory bandwidth (GBPS)
      float   : 3.90
      float2  : 3.90
      float4  : 3.90
      float8  : 3.74
      float16 : 3.01

    Single-precision compute (GFLOPS)
      float   : 2517.36
      float2  : 2515.34
      float4  : 2511.42
      float8  : 2502.04
      float16 : 2492.74

    No half precision support! Skipped

    Double-precision compute (GFLOPS)
      double   : 316.87
      double2  : 316.49
      double4  : 316.32
      double8  : 315.44
      double16 : 314.53

    Integer compute (GIOPS)
      int   : 1010.15
      int2  : 1009.38
      int4  : 1007.81
      int8  : 1004.66
      int16 : 1007.57

    Integer compute Fast 24bit (GIOPS)
      int   : 4853.92
      int2  : 4568.43
      int4  : 4517.10
      int8  : 4538.71
      int16 : 4337.04

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 4.12
      enqueueReadBuffer               : 4.01
      enqueueWriteBuffer non-blocking : 4.14
      enqueueReadBuffer non-blocking  : 4.11
      enqueueMapBuffer(for read)      : 10275.04
        memcpy from mapped ptr        : 4.14
      enqueueUnmap(after write)       : 8411.19
        memcpy to mapped ptr          : 4.07

    Kernel launch latency : 183.67 us

Martin Chang

Systems software, HPC, GPGPU and AI. I mostly write stupid C++ code. Sometimes does AI research. Chronic VRChat addict

I run TLGS, a major search engine on Gemini. Used by Buran by default.

martin \at clehaxze.tw
Matrix: @clehaxze:matrix.clehaxze.tw
Jami: a72b62ac04a958ca57739247aa1ed4fe0d11d2df