張安邦 / Martin Chang
 
Slides and pictures licensed under CC-BY 4.0 / GPLv3+. Pick one you like
Online view:
https://clehaxze.tw/slides/nsysu-sopc-2024
Offline download:
https://clehaxze.tw/slides/nsysu-sopc-2024.tar.zst
*Many limitations, really a convolution processor
The piper TTS system
A chain of many individual contributions
Someone did it - usefulsensors/useful-transformers
Now used by some for local transcription
10 GFLOPS when batch=1
???
Pain and suffer
from rknn.api import RKNN
rknn = RKNN()
rknn.configure(target_platform="rk3588")
rknn.load_onnx("/path/to/model.onnx")
rknn.build(do_quantization=False)
rknn.export_rknn("/path/to/model.rknn")
from rknn_lite.api import RKNNLite
rknn = RKNNLite()
rknn.load_krnn("/path/to/model.rknn")
out_data = rknn.infernece([input_data])
The average ML engineer doesn't know all these.
Nor pretrained models care.
void ggml_compute_forward_mul_mat(...) {
#elif defined(GGML_USE_RKNPU2)
// Run matrix multiplication on NPU if possible
if (ggml_rknpu2_can_mul_mat(src0, src1, dst)) {
// fprintf(stderr, "rknpu2\n");
if (params->ith == 0 && params->type == GGML_TASK_COMPUTE) {
ggml_rknpu2_mul_mat(src0, src1, dst, params->wdata, params->wsize);
}
return;
}
#endif
void load_all_data(...) {
#elif defined(GGML_USE_RKNPU2)
case GGML_BACKEND_GPU:
if (ggml_rknpu2_can_mul_mat_b(cur) == false) {
break;
}
// Copy and reorder data for NPU
ggml_rknpu2_transform_tensor(cur->data, cur);
if (!use_mmap) {
free(cur->data);
}
break;
#endif
void ggml_rknpu2_mul_mat(...) {
struct ggml_rknpu2_matmul_kernel* kernel = ggml_rknpu2_matmul_kernel_find(m, k, n, pack->type);
// GGML will switch batch size on the fly. So we need to create a new kernel if the batch size is different
if(kernel == NULL)
kernel = ggml_rknpu2_matmul_kernel_create(m, k, n, pack->type);
...
int ret = rknn_matmul_set_io_mem(kernel->matmul_ctx, kernel->A, &kernel->matmul_io_attr.A);
GGML_ASSERT(ret == 0);
ret = rknn_matmul_set_io_mem(kernel->matmul_ctx, pack->B, &kernel->matmul_io_attr.B);
GGML_ASSERT(ret == 0);
ret = rknn_matmul_run(kernel->matmul_ctx);
GGML_ASSERT(ret == 0);
...
memcpy(dst->data, kernel->C->virt_addr, m * n * sizeof(float));
:(
source: 黎明灰烬. CC-BY-SA 4.0
[N, N]
attention matrixsource: arXiv:1506.02626
Tenstorrent. Apache 2.0
#define QK4_0 32
typedef struct {
ggml_half d; // delta
uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;
Not only the chip companies are interested in AI
Not even 1st party. Just people wanting faster, lower power LLM
You can find me in