Inquiries on the Axon hardware

Question

Hi,

We've been investigating the use of the nrF54LM20B with the integrated Axon NPU.
In order to gauge the Axon's effectiveness, we've constructed a benchmark that profiles the Axon's inference times on 64 differently sized basic CNN models from ~4KB to ~1.2MB (INT8 quantized).

Hence, it is surprising that the results from our benchmark show that the Axon's performance scales practically linearly with model size.
We have not tested Axon models larger than ~1.2MB yet, but we expected to hit a memory wall far before that point.
Since we've been hitting the Axon with models where the largest layer is >1MB, it seems that either this tightly coupled memory is very generously sized, or our understanding of how the Axon operates is faulty in some way.

Our understanding of the Axon's internals are as follows, guided in part by the Axon NPU's block diagram shown above:
- CPU writes input values to interlay buffer
- Axon reads first layer weights from RRAM, input values from interlayer buffer into interconnected memory
- resulting activation values (i.e. intermediate output values) are stored in Axon's tightly coupled memory or written to interlayer buffer if necessary
- on to next layer...
- repeat until final output is written to interlayer buffer

From this understanding, it would follow that once an entire layer + previous activation values fail to fit in the Axon's tightly coupled memory that performance would crash as the Axon would be constantly swapping layer weights in and out.
Knowing roughly how Axon infers on an architectural level would be highly useful in letting us build ML models that are optimal for this NPU.
Thank you for any help you may be able to provide in clearing this up.