nRF54LM20B: AXON NPU Performance Specs

Question

We are evaluating the nRF54LM20B for an embedded audio classification application and are trying to determine whether the Axon NPU can meet our latency and energy targets compared to other edge-AI accelerators. 
 From the datasheet ( https://docs.nordicsemi.com/bundle/ps_nrf54LM20A/page/keyfeatures_html5.html ) we noticed the Axon NPU running current table (IAXONS0–IAXONS3) showing approximately 2.7–3.5 mA while the NPU is active, depending on the benchmark model. However, we were not able to find enough information to estimate throughput or energy per MAC. 
 For context, one of our CNN models currently deployed on a dedicated CNN accelerator has the following approximate characteristics:

~46–50 million MACs per inference

~57k parameters (~58 KB INT8 weights)

Input: 128 × 128 spectrogram

6 convolution layers (3×3) + 1 fully connected layer

INT8 quantized

On our current platform we achieve roughly 15–16 ms inference latency. We are evaluating whether the Axon NPU could meet similar requirements in a single-chip wireless solution. 
 To help us evaluate the architecture more accurately, could you provide additional information on the Axon NPU such as: 
 1. Effective compute throughput

Sustained INT8 MAC throughput (MAC/s) for convolution workloads

Alternatively MAC/cycle for the NPU data path

2. Benchmark latency

Inference latency for any of the reference models listed in the datasheet:

DS-CNN keyword spotting

MobileNetV1 (VWW)

ResNet (IC)

These models appear similar in scale to our CNN (~50M MAC). 
 3. Energy efficiency metrics

Typical energy per MAC (pJ/MAC) for INT8 inference

Alternatively energy per inference for the reference networks

4. Memory architecture

Does the Axon NPU use dedicated local SRAM for weights and activations, or does it operate primarily from system RAM via DMA?

5. Operator support

Could you confirm which neural network operators are executed directly on the Axon NPU versus requiring CPU fallback ? 
 Specifically we are interested in whether the following are natively supported on the NPU datapath :

3×3 convolution

stride convolution (e.g., stride=2)

max pooling / average pooling

activation functions (ReLU / ReLU6)

fully connected layers

SAME padding behavior for convolution

If any of these operations are not executed on the NPU , it would be helpful to understand whether they run on the CPU or another accelerator in the system.

6. Convolution data path width

How many INT8 MAC operations can the Axon NPU perform per cycle?

7. Audio inference examples

Any reference pipelines for PDM microphone → spectrogram → Axon NPU inference

The combination of MAC throughput, latency, and energy per MAC would allow us to estimate expected performance for our ~50M MAC CNN and determine whether the nRF54LM20B fits our latency and power envelope for this edge-AI application. 
 Thank you for your help in advance!