2026-06-05

Programming the ET-SoC-1

I left Tenstorrent some time early January 2026 and moved myself to AINekko, who interestingly bought Esperanto AI's IP and made the stack (and RTL!) open source. It's a fun chip to work with. No more weird VLIW masquerading as RISC-V (see my previous post about Tenstorrent's real programming model where you have to control 3 threads at the same time, which in practice makes each thread run in lockstep plus synchronization overhead). The ET-SoC-1 is different, with its own quirks and features. Let's go through them while drawing comparisons from Tenstorrent's Wormhole processor because that's really the only reference point publicly accessible.

The ET-SoC-1

Like the Tenstorrent Wormhole, the ET-SoC-1 is based on RISC-V, uses custom extensions, uses a NoC, and is arranged into a grid. On the ET-SoC-1, each node on the grid is called a shire. There are different types of shires: compute, memory, PCIe, etc. For simplicity, "shire" without specifying the type is assumed to be a compute shire.

The NoC level view of the ET-SoC-1. Shire = Compute Shire, DRAM = Memory Shire, PCIe = PCIe Shire, IO = IO Shire
Image: The NoC level view of the ET-SoC-1. Shire = Compute Shire, DRAM = Memory Shire, PCIe = PCIe Shire, IO = IO Shire

In a sort of GPU-like fashion (well, this was designed to be a GPU first, see online Esperanto Technologies history), the ET-SoC-1 is built with a hierarchy of cores. Shires contain 4 neighborhoods. Each neighborhood then contains 8 "Minion" cores. Each Minion core is an RV64IMF core (with custom vector extensions) with 2 hardware threads referred to as "harts", and a matrix accelerator accessible to the 1st hart in a minion. It's a lot to take in:

Each shire contains 4 neighborhoods and a shire-local L2 cache
Image: Each shire contains 4 neighborhoods and a shire-local L2 cache

Each neighborhood then contains 8 "Minion" cores. Each Minion is a 2-thread, in-order RV32IMF core with custom Esperanto vector extensions (256-bit wide). Yes, it's an in-order CPU with SMT like the Knights Landing CPUs. Unlike most commercial SMT implementations, the one in the Minion works more like what AMD did in the Bulldozer era: 2 physical register files sharing a single pipeline. Attached is a matrix engine with some matrix addressing capability. But only hart 0 has access to that engine. When using the matrix engine, hart 1 practically is idle or acts as a co-processor.

Block diagram of the Minion core
Image: Block diagram of the Minion core

A problematic fact: caches on the ET-SoC-1 are not coherent and whether hart 0 and 1 share the logical L1 partition is configurable. Attempts to parallelize across cores need to take care of cache boundaries. Otherwise threads will have partially updated cache lines and overwrite each other's newly computed data. We will see that in action later.

The custom (I won't call them proprietary, Nekko open sourced the RTL and thus the opcode) vector extension is in my opinion both a blessing and a curse. Esperanto designed the chip pre-RVV 1.0 (or even pre-RVV 0.7). They had no choice but to make their own vector extension, which they did really well. The vector ISA is very elegant, coherent, and carries little architectural baggage for being a GPU-inspired design. However, I have also heard a fair share of criticism that it should just do RVV in 2026. At least all of the compiler automatic vectorization should be able to generate code for the processor. Which is fair, but beyond the fact that the chip is older than the standard, I am not super sure that RVV is the right choice for a dedicated accelerator. There's the argument that code should just run. But also variable-length vectors have inherent overheads and you need tuning per chip, which defeats the purpose of running universal precompiled code... I am getting ahead of myself.

Comparing to the Tenstorrent Wormhole

Tenstorrent is the only other company selling a RISC-V grid-of-cores processor publicly, so it is natural to compare the ET-SoC-1 to it. Please refer to my older post about the Tenstorrent Wormhole processor, or the official Metalium guide documentation (they are the same if Tenstorrent hasn't replaced what I wrote) -- if you are not familiar with Tenstorrent's architecture.

Although the ET-SoC-1 is also a systolic-like processor, programming it is fundamentally different. There is no forced explicit DMA, no separation of data movement and compute kernel, no magic synchronization between internal threads, real L1 (and L2) caches, and the list goes on and on. The ET-SoC-1 programs more like a Xeon Phi with incoherent caches.

The Esperanto SIMD extensions

Esperanto designed a pretty nice SIMD extension that is reminiscent of the SSE instructions, using an extended f0~f31 register file for vector storage and a mask register (there's several but only m0 takes any effect, the others are for temporary storage) that actually masks side effects of vector operations, including otherwise invalid memory access artifacts.

The 0th lane of the vector register is also the scalar register
Image: The 0th lane of the vector register is also the scalar register

The vector assembly is far easier to read compared to AVX and NEON. Closer to RVV. Take the following example code that implements the ReLU function:

fbc.ps  f10, %[z]          // bc = broadcast the value z (constant zero passed to GAS)
flw.ps  f11, %[x]          // load vector word from memory
fmax.ps f12, f10, f11      // max(0, x)
fsw.ps  f12, %[result]     // store vector word to memory

All vector instructions are prefixed with f (unsure why they didn't choose v instead, maybe they initially did not assume the f RISC-V extension and imagined all operations would be vector operations), followed by the operation name and suffix. ps indicates packed single (aka per-lane single precision). Likewise there's pi for packed integer. Of course there's also a selection of horizontal operations to choose from, otherwise reduction becomes a pain and a large set of neural net operations become very tedious and slow to implement using scalar operations. The following is the standard pattern to reduce a single vector register into a single scalar value using horizontal operations and abusing the fact that the 0th element of a vector register is the scalar value:

// Pairwise sum within each 128-bit half
fswizz.ps f1, f10, 0xB1         // Swaps: e0<->e1 and e2<->e3
fadd.ps   f2, f10, f1

// Complete the sum for each 128-bit half
fswizz.ps f3, f2, 0x4E          // Swaps: e0,e1 <-> e2,e3
fadd.ps   f4, f2, f3

// Sum across the two 128b halves
fmvz.x.ps t0, f4, 4
fbcx.ps   f5, t0


// somewhere in C
// float vout;
fadd.ps   %[vout], f4, f5      // vout is a scalar, but since 0th
                               // element is the scalar value, we can
                               // use it directly

Please refer to the PRM for more details on the vector instruction set.

Hello world

The host API of the platform looks very much like OpenCL - so much so I have good reasons to believe it is designed to mimic OpenCL. The initialization logic is practically the same with one important caveat - support for an emulated chip. You don't need a physical chip (though it's very slow) to create programs for the platform.

// Emulation
#ifdef EMULATION
sysEmuOptions.executablePath = fs::path(SYSEMU_INSTALL_DIR) / "sys_emu";
sysEmuOptions.runDir = current_path;
sysEmuOptions.maxCycles = std::numeric_limits<uint64_t>::max();
sysEmuOptions.minionShiresMask = 0x1FFFFFFFFu;
sysEmuOptions.puUart0Path = current_path / "pu_uart0_tx.log";
sysEmuOptions.puUart1Path = current_path / "pu_uart1_tx.log";
sysEmuOptions.spUart0Path = current_path / "spio_uart0_tx.log";
sysEmuOptions.spUart1Path = current_path / "spio_uart1_tx.log";
sysEmuOptions.startGdb = false;

// Run on simulated device!
std::shared_ptr<dev::IDeviceLayer> deviceLayer =
    dev::IDeviceLayer::createSysEmuDeviceLayer(sysEmuOptions, 1);
#else
// Run on real device
std::shared_ptr<dev::IDeviceLayer> deviceLayer =
    dev::IDeviceLayer::createPcieDeviceLayer();
#endif


auto runtime = rt::IRuntime::create(deviceLayer);
auto devices = runtime->getDevices();
if (devices.empty()) {
    std::cerr << "No devices found.\n";
    return 1;
}
auto device = devices[0];
auto stream = runtime->createStream(device);

After setup, it works approximately like OpenCL. The API doesn't support online compilation; you have to compile the kernels ahead of time.

auto elfData = readFile(kernelPath);
if (elfData.empty()) {
    std::cerr << "Failed to read kernel ELF.\n";
    return 1;
}
auto loadResult = runtime->loadCode(stream, elfData.data(), elfData.size());
runtime->waitForEvent(loadResult.event_);
auto kernel = loadResult.kernel_;

Unlike OpenCL, which abstracts the workload into work groups and work sizing, the ET platform just asks which you want to be active during the kernel run, controlled by a single bitmask. And then all neighborhoods (and by extension, all minions) within the active shire are active.

rt::KernelLaunchOptions launchOpts;
launchOpts.setShireMask(0x1); // <- ET-SoC-1 contains 32 compute shires. This mask controls which shires are active.

// Enable debug print for `et_printf`. The API allows you to control which shire and minion in the shire
// has printing enabled. For now, this says "I want to enable printing for all threads in shire 0"
launchOpts.setUserTracing(
    reinterpret_cast<uint64_t>(traceDevBuf),
    static_cast<uint32_t>(kTraceBufferSize),
    0,                              // threshold
    0x1,                            // trace shireMask
    0xFFFFFFFFFFFFFFFFULL,          // threadMask - all threads
    0xFFFFFFFFU,                    // eventMask - all events
    0xFFFFFFFFU                     // filterMask - all levels
);

// Launch the kernel
// Unlike OpenCL, again, parameters are a binary blob on ET (though we are not using it right now)
std::vector<std::byte> kernelArgs(64);
auto launchEvent = runtime->kernelLaunch(
    stream, kernel, kernelArgs.data(), kernelArgs.size(), launchOpts);
runtime->waitForStream(stream);

And the kernel that'll be running on the device:

int64_t entry_point(void)
{
    et_printf("Hello World from hart %d\n", get_hart_id());
    return 0;
}

Which later in the host code we grab the print buffer and print them out:

auto* traceHeader = reinterpret_cast<const trace_buffer_std_header_t*>(hostTraceBuf.data());
const trace_entry_header_t* entry = nullptr;
int count = 0;

while ((entry = Trace_Decode(traceHeader, entry))) {
    if (entry->type != TRACE_TYPE_STRING) {
        continue;
    }
    auto* strEntry = reinterpret_cast<const trace_string_t*>(entry);
    std::cout << "[hart " << entry->hart_id << "] " << strEntry->string << "\n";
    ++count;
}

And that shows:

[hart 0] Hello World from hart 0
[hart 1] Hello World from hart 1
[hart 2] Hello World from hart 2
[hart 3] Hello World from hart 3
[hart 4] Hello World from hart 4
[hart 5] Hello World from hart 5
...

You can find the source code here.

Vector addition

Printing is nice but how to do math? Recall from the introduction earlier that the cache is not coherent, thus you need to take care of the write set partition and visibility. For vector addition, it is trivial - make sure on either the host or device side that each hart processes data at a 64-byte boundary. The simplest kernel looks like the following:

#define CACHELINE_SIZE 64
int64_t entry_point(KernelParameters* params, void* env)
{
    int64_t threadId = get_hart_id();
    int64_t workChunkSize = params->size / params->numThreads;
    int64_t baseIdx = workChunkSize * threadId;

    if(workChunkSize % (CACHELINE_SIZE / sizeof(int)) != 0) {
        return -1; // indicate to the host that execution failed
    }

    int* a = params->a;
    int* b = params->b;
    int* c = params->c;

    for(int64_t i=0; i<workChunkSize;i++) {
        c[i+baseIdx] = a[i+baseIdx] + b[i+baseIdx];
    }
    return 0;
}

Scalar code is slow. So let's also look at a vectorized version. The same cache coherency requirements apply, but on top of that you need to manually vectorize the loop and invoke SIMD instructions to perform the addition. Note the use of flq2 instead of flw.ps here. It makes no difference here but flq2 is not affected by the m0 masking mechanism while flw.ps is.

int64_t entry_point(KernelParameters* params, void* env)
{
    int64_t threadId = get_hart_id();
    int64_t workChunkSize = params->size / params->numThreads;
    int64_t baseIdx = workChunkSize * threadId;
    int* a = params->a;
    int* b = params->b;
    int* c = params->c;

    if(workChunkSize % (CACHELINE_SIZE / sizeof(int)) != 0) {
        return -1; // indicate to the host that execution failed
    }

    // Set the mask register m0 to 0xFF (all 8 lanes active)
    // MOV.M.X moves an X register ORed with an 8-bit immediate into an M register.
    asm volatile ("mov.m.x m0, zero, 0xFF");

    // The vectorized version of the loop. Only works if problem size is a multiple
    // of 8. However, because of cache coherency, the multiplier is 16 so it works out
    for (int64_t i=baseIdx; i <= baseIdx+workChunkSize - 8; i += 8) {
        asm volatile (
            "flq2    f0, 0(%0)\n\t"     // Unmasked 256-bit load from array a
            "flq2    f1, 0(%1)\n\t"     // Unmasked 256-bit load from array b
            "fadd.pi f2, f0, f1\n\t"    // Packed Integer Add (under control of m0)
            "fsq2    f2, 0(%2)\n\t"     // Unmasked 256-bit store to array c
            : // No output operands
            : "r"(a + i), "r"(b + i), "r"(c + i)
            : "f0", "f1", "f2", "memory"
        );
    }
    return 0;
}

You can find the source code in the following link:

Interesting hardware features

That's the basic intro to the hardware. Now, what else can the hardware do? Beyond being a 1024-thread, non-coherent, RISC-V with a non-standard vector, many-core processor? The hardware itself supports quite some tricks and enables others that are more... questionable. Not to say that the directly supported features are not interesting.

(Virtual) L3 cache

I still remember the shock I had when IBM announced the Telum chip for their z16 mainframe. What do you mean you can use another core's L3 as a chip-wise shared L3? Sure you can do that... in theory... but is that a good idea? Turns out the ET-SoC-1 supports something similar. Each shire comes with 4MB of shared L2 banks, and, by default, 2.5MB of that is partitioned as L2SCP (which we will discuss later), 0.5MB as the shire-local L2 cache, and 1MB as the chip-wise shared L3.

Granted, the ET-SoC-1 has a static L3 instead of IBM's crazy dynamic victim cache. Anyway, that L3 is very useful and acts as a shock absorber before things go into DRAM, which is a very nice design choice by Esperanto. But it is also a bit of a source of headache. A lot of the times when the PRM says "bypasses the cache hierarchy" it means it bypasses the L1 and L2, but L3 is still definitely in the path. Also note that having an L3 and the fact that caches are not coherent can mess things up quite badly. Particularly, if you intend on doing anything that goes directly to DRAM, you must be sure that there's no copy of the same cache line in the L3.

Classic non-coherent cache problems.

The matrix engine

There's actually 2 parts to the matrix engine. One is the actual part that does the matrix multiplication, and the other part that moves your data into the former. The PRM calls it the tensor load command. But I usually call it the 2D addressing and tile load.

One of the problems Tenstorrent has, despite some really great progress I heard from {sources}, is that a lot of times, for these ASICs to work well, the data must be laid out just right in memory. Tenstorrent has their 32x32 grid of values in memory and demands almost all data be laid out in such a grid before use. That is a problem. Most inference frameworks, like GGML, ONNXRuntime, TinyGrad, you name it, assume you are using a GPU and will give you row-major order data and expect performance consequences from that fact - views are free, reshape can be free, etc., which is almost always false when you tile your layout.

ET-SoC-1 provides an elegant solution for that. Instead of bending the software contract or adding massive hardware that absorbs the streamed data for reuse, just make the DMA respect the 2D tiles. When loading a tile, instead of "here's the pointer, load 2KiB of data in", you tell the hardware the pointer, the stride between rows, and how many rows to load from (as the tile width is usually fixed, for compute efficiency reasons). This way, you can set the stride to the matrix width and the number of rows to load to the tile height, issue the DMA, boom, you have your tile loaded.

How the matrix engine loads tiles
Image: How the matrix engine loads tiles

Matrix load supports transformations inline with the load: interleaving (swapping odd and even elements), transposition, or none. There is also a TensorQuant subsystem that gives you a semi-programmable matrix dequantization from int8 to FP32. But that works more like how traditional CNN quantization works instead of the more recent blocked floating point or OCP MXFP* formats.

The matrix engine itself is decent. Proper FP32 support with weird but usable FP16 support. The funny and interesting thing is where the matrix compute stores everything. For what I can only assume to be space efficiency reasons, there isn't really a matrix register to hold your input/output matrices. Instead, part of your L1 cache is taken to hold the input matrices and stores them in your regular f0~f31 vector registers.

This leaves me feeling pulled both ways. Downsides, immediately:

  • Reduced L1 cache capacity
  • You cannot carry any floating-point state across matrix math, without compiler support
  • You either clobber the entire FP register set (slow) OR break the C abstract machine (breaks in weird ways)
  • Storing and resuming partial results is slow (32 vector store + 32 vector load)
  • Tiling (beyond hardware tiles) is hard to achieve because output and input do not live in the same address space

Some upsides:

  • You can easily manipulate the matrix multiplication result... just use vector operations
  • Just remember every register is used, so you need the stack space to free up some for temporary values
  • And you have to code all FP operations, even if scalar, in assembly because you need to know exactly which registers are used and to save + restore them

L2SCP

The scratchpad area is huge by default: 2MB per shire. L2SCP acts more or less like a GPU's local memory, which is shared by a group of cores (all of the harts in a shire). With optional global mode that maps the local L2SCP from all shires to the same address across all shires -- that is just a fucking good idea.

Hart 0 and 1

As said above, the Minion core packs 2 threads into a single physical core. The 2 threads are referred to as harts: hart 0 and hart 1. The Minion is a single-issue, in-order core with 2 physical register files (one for each hart). The harts are mostly symmetrical with the exception that only hart 0 has access to the matrix engine. As such, most kernels you write will use both hart 0 and 1 to achieve maximum performance. Only when the matrix engine is needed will you mostly use hart 0 exclusively, with hart 1 either idle or doing on-the-fly repacking of matrix data (say, the data layout is not what the matrix engine supports). Obviously making hart 1 idle leaves more issuing opportunities for hart 0. This is sometimes a balancing act.

Confusingly, global or neighborhood-local hart IDs are also used during kernel programming. Hart 0 may mean the 0th hart in the neighborhood, or the 0th hart across the chip... you can usually figure that out by context. But warning where warning is due.

The FlashAttention kernel

Fast FlashAttention on the ET-SoC-1 is hard, but it perfectly demonstrates how the chip can be used and how elegant Esperanto designed the thing to handle unforeseen needs. Bear with me around the complexity and you'll see how far the chip can go. Keyword being "can", like optimizations go, doing no work is better than doing some work fast. But GGML simply asks for something that the hardware can't do directly. In practice in LLM inference, GGML wants FP16 operations with the FLASH_ATTN_EXT accepting and producing the following input and output shapes:

Tensorne0 (contig)ne1ne2ne3Comment
Qn_embd_kn_batchn_headne3
Kn_embd_kn_kvn_head_kvne3
Vn_embd_vn_kvn_head_kvne3NOT transposed, unlike Nvidia's
maskn_kvn_batchne32ne33
resultn_embd_vn_headn_batchne3Permuted vs Q

To recap how FlashAttention works, and because the algorithm is so complicated that even I have trouble keeping it in my head, the following pseudo-code demonstrates a very high level overview of the algorithm.

for (int h  = 0; h  < n_head;  h++) {                 // each head is independent
  for (int q0 = 0; q0 < n_batch; q0 += BR) {          // tile of BR query rows

    tensor_load(Q_tile, Q[:, q0:q0+BR, h]);           // load once, reuse below

    m = -inf;                                         // running row-max  [BR]
    l = 0;                                            // running row-sum  [BR]
    O = 0;                                            // running output   [BR, D_v]

    // stream KV in tiles
    for (int k0 = 0; k0 < n_kv; k0 += BC) {

      tensor_load(K_tile, K[:, k0:k0+BC, h]);
      tensor_load(V_tile, V[:, k0:k0+BC, h]);

      // 1st reduction: S = Q · K^T
      tensor_fma(S, Q_tile, K_tile);                  // S[BR, BC]
      S *= scale;

      // online softmax update (per query row)
      m_new = max(m, rowmax(S));
      alpha = exp(m - m_new);                         // rescale old state
      P     = exp(S - m_new);                         // probs for this tile
      l     = alpha * l + rowsum(P);

      // 2nd reduction: O += P · V
      O = alpha * O;
      tensor_fma(O, P, V_tile);                       // O[BR, D_v] += P · V

      m = m_new;
    }

    // finalize and write
    O = O / l;
    tensor_store(result[:, h, q0:q0+BR], O);          // permuted layout
  }
}

It looks easy in concept. In reality, it is a different problem.

First, even though the matrix engine supports FP32, FP16, and INT8 matrix operations, GGML wants FP16 because that's what Nvidia's Tensor Cores use. The FP16 matrix multiplication in ET-SoC-1 needs the B matrix to be interleaved as a hardware quirk. Looking at the input shape, the Q @ K works in the same fashion as GGML's MUL_MAT operation and K is pre-transposed. However, there is no interleave and transpose load operation support in the matrix engine.

The table in PRM showing supported tensor load transformations
Image: The table in PRM showing supported tensor load transformations

Next, in the pseudo-code S is an intermediate matrix produced by the matrix engine, on-chip it lives on f0~f31 registers, and is used later for the 2nd tensor_fma. Yet there are a slew of online softmax calculations that also need the same floating-point registers for vector operations (unless you are willing to use softfp, but that's way too slow for running an LLM). And finally, ideally we would tensor store O back to memory. But the matrix engine only does matrix multiplication, without any element-wise capabilities.

For all the readers who understand how HPC usually goes, the solution is both obvious and insane at the same time. Given that hart 1 does not have access to the matrix engine and would be idle during FlashAttention anyway, the solution to needing transpose + interleave is to run hart 1, create a semaphore for signaling, use hart 1's vector operations to load and interleave the sub-matrix, write to L2SCP, and signal hart 0 to load and transpose into the matrix engine. That function looks like the following:

// interleave a 16x16 sub-matrix in K and store to an output
// pointer (on the L2SCP) so hart 0 can load and transpose it
void pack_k_for_transpose16(et_fp16_t * out,
                       const char * k_base,
                       int64_t kv_start,
                       int64_t dk_start,
                       int64_t kv_count,
                       int64_t nb1_k)
{
    unsigned long old_mask;
    __asm__ volatile(
        "mova.x.m  %[ms]            \n\t"
        "mov.m.x   m0, x0, 0xFF     \n\t"
        : [ms] "=&r"(old_mask) ::);

    for (int j = 0; j < (int)kv_count; ++j) {
        const et_fp16_t * k_row =
            (const et_fp16_t *)(k_base + (kv_start + j) * nb1_k) + dk_start;
        et_fp16_t * even_row = out + (j * 2)     * 32;
        et_fp16_t * odd_row  = out + (j * 2 + 1) * 32;
        __asm__ volatile(
            "flw.ps    f2, 0(%[src0])  \n\t"   // load row[0..15]
            "flw.ps    f3, 0(%[src1])  \n\t"   // load row[16..31]
            "fpackreph.pi f4, f2       \n\t"   // even_lo from src0
            "fpackreph.pi f6, f3       \n\t"   // even_lo from src1 (interleaved)
            "fsrli.pi  f5, f2, 16      \n\t"   // shift src0 for odd
            "fsrli.pi  f7, f3, 16      \n\t"   // shift src1 for odd (interleaved)
            "fpackreph.pi f5, f5       \n\t"   // odd from src0
            "fpackreph.pi f7, f7       \n\t"   // odd from src1
            "mov.m.x   m0, x0, 0x0F   \n\t"
            "fcmovm.ps f4, f4, f6      \n\t"   // merge even halves
            "fcmovm.ps f5, f5, f7      \n\t"   // merge odd halves
            "mov.m.x   m0, x0, 0xFF   \n\t"
            "fsw.ps    f4, 0(%[even])  \n\t"
            "fsw.ps    f5, 0(%[odd])   \n\t"
            :
            : [src0] "r"(k_row),
              [src1] "r"(k_row + 16),
              [even] "r"(even_row),
              [odd] "r"(odd_row)
            : "f2", "f3", "f4", "f5", "f6", "f7", "memory"
        );
    }

    __asm__ volatile(
        "mova.m.x  %[ms]            \n\t"
        :: [ms] "r"(old_mask)
    );

    for (int j = (int)kv_count; j < TILE_KV; ++j) {
        et_fp16_t * even_row = out + (j * 2)     * 32;
        et_fp16_t * odd_row  = out + (j * 2 + 1) * 32;
        for (int l = 0; l < TILE_K / 2; ++l) {
            even_row[l] = 0;
            odd_row[l]  = 0;
        }
    }
}


// To invoke it in hart 1
for (int64_t dk_chunk = 0; dk_chunk < dk; dk_chunk += TILE_K) {
    int buf = chunk_id & 1;

    // Back-pressure: before overwriting buf[buf] on chunk N
    // (which will displace chunk N-2), wait for hart 0 to
    // post that it's done with chunk N-2. Gates both
    // directions of double-buffering.
    if (chunk_id >= 2) {
        et_sem_wait(ET_BARRIER_MINION);
    }

    // Prefetch K data for this chunk
    prefetch_kv_to_l2(k_head, kv_base, dk_chunk, kv_count, k->nb[1]);

    pack_k_for_transpose16(scp_kp[buf], k_head, kv_base, dk_chunk,
                           kv_count, k->nb[1]);

    FENCE;
    // Flush all writes to L2SCP so hart 0 can see the packed K data
    // when it issues the load
    flush_to_l2(scp_kp[buf], 16, 64);
    flush_to_l2((et_fp16_t *)((char *)scp_kp[buf] + 1024), 16, 64);
    WAIT_CACHEOPS;

    // Signal: this buf is ready for hart 0 to consume.
    et_sem_post(ET_BARRIER_MINION);

    chunk_id++;
}

Then hart 0 can use what hart 1 produces.

... // setup so we can pipeline the load
for (int64_t i = 1; i < n_dk_chunks; i++) {
    int buf            = chunk_id & 1;
    int k_slot_prev    = (int)((i - 1) & 1);
    int k_slot         = (int)(i & 1);

    et_sem_wait(ET_BARRIER_MINION);
    tensor_load(
        false, false, K_BUFS[k_slot], TENSOR_LOAD_TRANSPOSE16, 0,
        (uint64_t)scp_kp[buf], 0, 15, 64, 1);

    tensor_fma(
        (kv_count < TILE_KV), 3, 0, 15, 0,
        false, false, false, false,
        K_BUFS[k_slot_prev], (uint64_t)(i - 1),
        TENSOR_FMA_OP_FP16, (i == 1));

    tensor_wait(TENSOR_LOAD_WAIT_1);   // K[i] in L1
    et_sem_post(ET_BARRIER_MINION);    // release scp_kp[buf] EARLY
    tensor_wait(TENSOR_FMA_WAIT);      // then wait FMA[i-1]
    chunk_id++;
}
... // and some tail handling

Since tensor_fma writes to the entire floating-point register file, we must manually clobber them to force the compiler to not carry floating-point states across tensor_fma calls despite its looking like a regular function. For decode, we use f0 and f1 as those are what get outputted (first row, since batch=1). Then extract the working data so we can run the softmax statistics update against it.

__asm__ volatile("" ::: "f0", "f1");

// Extract QK^T scores from vector register file
unsigned long _ms;
__asm__ volatile(
    "mova.x.m  %[ms]                \n\t"
    "mov.m.x   m0, x0, 0xFF         \n\t"
    "fbc.ps    f2, 0(%[p_scale])    \n\t"
    "fmul.ps   f0, f0, f2           \n\t"
    "fmul.ps   f1, f1, f2           \n\t"
    "fsw.ps    f0, 0(%[dst])        \n\t"
    "fsw.ps    f1, 32(%[dst])       \n\t"
    "mova.m.x  %[ms]                \n\t"
    : [ms] "=&r"(_ms)
    : [dst] "r"(scores), [p_scale] "r"(&scale)
    : "f0", "f1", "f2", "memory"
);

Once the raw QK^T scores are extracted, we need to compute the online softmax. You'd think we could just run exp() and sum them up, but the reduction in scalar would be slow. Instead, we pipeline the exponentiation in vector assembly, interleaving calculations for both halves of the row across independent registers (like f2 and f3) while keeping track of the running maximum M and denominator S:

const float log2e = 1.4426950408889634f;
float S_tile;
unsigned long _ms;
__asm__ volatile(
    "mova.x.m  %[ms]              \n\t"
    "mov.m.x   m0, x0, 0xFF       \n\t"
    "flw.ps    f2, 0(%[sc])       \n\t" // Load first 8 scores
    "fbc.ps    f4, 0(%[pM])       \n\t" // Broadcast max M
    "flw.ps    f3, 32(%[sc])      \n\t" // Load next 8 scores
    "fbc.ps    f5, 0(%[pL])       \n\t" // Broadcast log2e
    "fsub.ps   f2, f2, f4         \n\t" // score - M
    "fsub.ps   f3, f3, f4         \n\t"
    "fmul.ps   f2, f2, f5         \n\t" // (score - M) * log2e
    "fmul.ps   f3, f3, f5         \n\t"
    "fexp.ps   f2, f2             \n\t" // exp2 (pipelined!)
    "fexp.ps   f3, f3             \n\t"
    "fsw.ps    f2, 0(%[wt])       \n\t" // Save weights
    "fsw.ps    f3, 32(%[wt])      \n\t"
    "fadd.ps   f2, f2, f3, rne    \n\t"
    "fswizz.ps f3, f2, 0xB1       \n\t"
    "fadd.ps   f2, f2, f3, rne    \n\t"
    "fswizz.ps f3, f2, 0x4E       \n\t"
    "fadd.ps   f2, f2, f3, rne    \n\t"
    "fmvz.x.ps t0, f2, 4          \n\t"
    "fbcx.ps   f3, t0             \n\t"
    "fadd.ps   %[st], f2, f3, rne \n\t" // Sum across 16 elements
    "mova.m.x  %[ms]              \n\t"
    : [ms] "=&r"(_ms), [st] "=f"(S_tile)
    : [pM] "r"(&M), [pL] "r"(&log2e),
      [sc] "r"(scores), [wt] "r"(weights)
    : "f2", "f3", "f4", "f5", "t0", "memory"
);

After updating our softmax stats, we convert these weights back to FP16 and multiply them with V. Unlike QK^T, V is not transposed in memory (row-major, size n_kv x n_embd_v). Since V is not transposed, we can load full tiles directly from DRAM using TENSOR_LOAD_INTERLEAVE16 to match the matrix engine's FMA requirements. We double-buffer these loads to hide the DRAM latency behind the math. But wait - what if we hit a partial tile near the end of the sequence? The hardware interleaving load will read past the tensor bounds and fetch garbage. So we have to write a software fallback (pack_v_interleaved) to manually format partial tiles on hart 0.

Just when you think you have everything working, you hit the decode utilization headache. During LLM generation, the batch size is usually 1. This means the total number of active rows is just the number of attention heads (usually 32). If you distribute one row per minion core round-robin, only 32 of the 1024 threads on the chip are running. The other 992 threads are sitting cold, which drops your utilization to a painful 3%. To keep the chip hot, we implement Split-KV, grouping minions in the same shire into teams of size k_splits that cooperate on a single row by dividing the KV cache dimension. Each minion computes a partial running max M_p, running sum S_p, and local accumulator vector acc_p in L2SCP.

But this is where you hit incoherent caches. Since the L1D caches are not coherent, if peer minions write their partial stats and accumulators to L2SCP, how does the team reducer (minion 0) read them? If it just accesses the memory, it will read stale garbage from its own L1D cache. We make the minions write and flush their data to L2SCP using cache-bypass or explicit flushes, and then hit a shire-local execution barrier (et_barrier(ET_BARRIER_SHIRE)). Before the reducer can read a peer's stats, it must explicitly evict its own L1D copy of the peer's addresses (evict_to_l2), forcing the next read to fetch the fresh values from the shared L2SCP. It then rescales its own accumulator and adds the peer's accumulator using a custom vector assembly merge loop:

static inline void __attribute__((always_inline))
merge_rescale_add_asm(float * acc,
                      const float * peer_acc,
                      int64_t dv,
                      float alpha_own,
                      float alpha_peer) {
    unsigned long old_mask;
    __asm__ volatile(
        "mova.x.m  %[ms]              \n\t"
        "mov.m.x   m0, x0, 0xFF       \n\t"
        "fbc.ps    f4, 0(%[ao])       \n\t" // Broadcast own rescale factor
        "fbc.ps    f5, 0(%[ap])       \n\t" // Broadcast peer rescale factor
        : [ms] "=&r"(old_mask)
        : [ao] "r"(&alpha_own), [ap] "r"(&alpha_peer)
        : "f4", "f5"
    );

    for (int64_t d = 0; d < dv; d += 8) {
        __asm__ volatile(
            "flw.ps    f2, 0(%[a])      \n\t" // Load own accumulator
            "flw.ps    f3, 0(%[p])      \n\t" // Load peer accumulator
            "fmul.ps   f2, f2, f4       \n\t" // Own *= alpha_own
            "fmul.ps   f3, f3, f5       \n\t" // Peer *= alpha_peer
            "fadd.ps   f2, f2, f3       \n\t" // Add
            "fsw.ps    f2, 0(%[a])      \n\t" // Save back to L2SCP
            :
            : [a] "r"(acc + d), [p] "r"(peer_acc + d)
            : "f2", "f3", "memory"
        );
    }
    __asm__ volatile("mova.m.x %0" :: "r"(old_mask));
}

A second shire barrier ensures the other minions don't overwrite their L2SCP workspaces while the reducer is still reading. On completion, the reducer normalizes the accumulator by multiplying by the final inverted sum (1/S) and writes the finished row back to DRAM. The high level flow looks as follows:

High level flow chart of FlashAttention on the ET-SoC-1
Image: High level flow chart of FlashAttention on the ET-SoC-1

And that is only half of the story of getting FlashAttention working and as fast as the chip can do. It is a major challenge but hopefully there is no challenge too difficult for the chip to solve. It can be a pain to implement an algorithm that was never intended by the designers. But it is always viable and the results are often reasonably efficient. Maybe I should make a seperate post all about implementing FlashAttention on the ET-SoC-1.


Either case, hopefully this post is an intresting read for people wanting to see how Esperanto's chip works or is just unsatisfied about GPU's domination and wish we could do better. Feel free to hop into AIFoundry - Nekko's open source community and join our mission of open source AI.