We are evaluating the nRF54LM20B for an embedded audio classification application and are trying to determine whether the Axon NPU can meet our latency and energy targets compared to other edge-AI accelerators.
From the datasheet (https://docs.nordicsemi.com/bundle/ps_nrf54LM20A/page/keyfeatures_html5.html) we noticed the Axon NPU running current table (IAXONS0–IAXONS3) showing approximately 2.7–3.5 mA while the NPU is active, depending on the benchmark model. However, we were not able to find enough information to estimate throughput or energy per MAC.
For context, one of our CNN models currently deployed on a dedicated CNN accelerator has the following approximate characteristics:
-
~46–50 million MACs per inference
-
~57k parameters (~58 KB INT8 weights)
-
Input: 128 × 128 spectrogram
-
6 convolution layers (3×3) + 1 fully connected layer
-
INT8 quantized
On our current platform we achieve roughly 15–16 ms inference latency. We are evaluating whether the Axon NPU could meet similar requirements in a single-chip wireless solution.
To help us evaluate the architecture more accurately, could you provide additional information on the Axon NPU such as:
1. Effective compute throughput
-
Sustained INT8 MAC throughput (MAC/s) for convolution workloads
-
Alternatively MAC/cycle for the NPU data path
2. Benchmark latency
-
Inference latency for any of the reference models listed in the datasheet:
-
DS-CNN keyword spotting
-
MobileNetV1 (VWW)
-
ResNet (IC)
-
These models appear similar in scale to our CNN (~50M MAC).
3. Energy efficiency metrics
-
Typical energy per MAC (pJ/MAC) for INT8 inference
-
Alternatively energy per inference for the reference networks
4. Memory architecture
-
Does the Axon NPU use dedicated local SRAM for weights and activations, or does it operate primarily from system RAM via DMA?
5. Operator support
-
Could you confirm which neural network operators are executed directly on the Axon NPU versus requiring CPU fallback?
Specifically we are interested in whether the following are natively supported on the NPU datapath:
-
3×3 convolution
-
stride convolution (e.g., stride=2)
-
max pooling / average pooling
-
activation functions (ReLU / ReLU6)
-
fully connected layers
-
SAME padding behavior for convolution
If any of these operations are not executed on the NPU, it would be helpful to understand whether they run on the CPU or another accelerator in the system.
-
6. Convolution data path width
-
How many INT8 MAC operations can the Axon NPU perform per cycle?
7. Audio inference examples
-
Any reference pipelines for PDM microphone → spectrogram → Axon NPU inference
The combination of MAC throughput, latency, and energy per MAC would allow us to estimate expected performance for our ~50M MAC CNN and determine whether the nRF54LM20B fits our latency and power envelope for this edge-AI application.
Thank you for your help in advance!