nRF7002 drops downlink in power save / RPU powers down RX while still advertising PM=0 (no PM=1 Null), so the AP delivers directly and the frame is lost

Hello all !

Summary

While in power save the nRF7002 RPU powers down its receiver without sending a PM=1 (Null) frame first, so from the AP point of view the station is still in active mode (PM=0). The AP therefore delivers downlink frames (for example a TCP SYN-ACK) directly instead of buffering them. Since the receiver is off the frame is never ACKed, and the AP retransmits it at the MAC layer (retry=1) around 40 times into a dead receiver. The result is intermittent failure of connection setup (TCP handshake, DHCP-ACK, DNS / gateway ARP) ending in the usual 3 s socket timeout, recovered only by a retry or reconnect. First attempt connections are unreliable on perfectly healthy APs.

Environment

  • nRF7002 (companion radio) + nRF5340 host (MDBT53-P1M)
  • nRF Connect SDK v3.2.4 (Zephyr 4.2.99), nrf_wifi driver
  • CONFIG_NRF_WIFI_LOW_POWER=y, legacy PS (WIFI_PS_MODE_LEGACY), WIFI_PS_PARAM_STATE=enabled
  • WPA2-PSK, normal infrastructure AP, 2.4 GHz ch 6, beacon interval 100 TU, DTIM period 1, STA AID = 1

Evidence

1. TCP handshake, decrypted (server 20.107.35.42:443, STA 192.167.14.97)

13.342  STA → server   52809 → 443 [SYN]
13.364  server → STA   443 → 52809 [SYN, ACK]                       +22 ms, server is healthy
13.367  server → STA   [TCP Retransmission] [SYN, ACK]   ┐
13.375  server → STA   [TCP Retransmission] [SYN, ACK]   │  ~17 copies of the SAME
  ...                                                    │  SYN-ACK in ~66 ms
13.430  server → STA   [TCP Retransmission] [SYN, ACK]   ┘
14.022  STA → server   [TCP Retransmission] 52809 → 443 [SYN]  ×7
   SYN-ACK retransmit count per connection (4 cycles): 39, 3, 3, 5

2. Same window at the raw 802.11 layer (no decryption), this is the key part

time     dir         frame      retry  PM
13.342   STA → AP    QoS Data   0      0     the SYN (STA awake to transmit)
13.364   AP  → STA   QoS Data   0      -     SYN-ACK, 1st delivery
13.367   AP  → STA   QoS Data   1      -     ┐
13.375   AP  → STA   QoS Data   1      -     │ ~17× the SAME frame, wlan.fc.retry=1
  ...                                        │ = 802.11 MAC retransmission (no ACK from STA)
13.430   AP  → STA   QoS Data   1      -     ┘
   STA "Null PM=1" frames seen at: 12.460 … 15.020   → NONE in between (the whole handshake)
  • Every SYN-ACK copy has wlan.fc.retry = 1, so the AP is re-sending one un-ACKed frame. This is not TCP RTO.
  • The STA sends no PM=1 Null during the handshake. It advertises active mode the whole time, yet does not ACK, so its receiver must be off.

3. Beacon TIM: the AP never buffers it (so it is not a "missed DTIM" problem)

STA AID = 1
beacon during the storm:  dtim_count=0  dtim_period=1  multicast=0  bitmap=0x78bb07
   0x78 = 0b01111000 → AIDs 3,4,5,6 have buffered data;  AID 1 (our STA) is NOT set

The PS buffering and TIM signalling on the AP side works fine, it flags other AIDs. Our STA's reply is simply never queued, because the STA never told the AP it was asleep.

4. Same failure on DHCP and DNS (device console, separate runs)

DHCP:  Received dhcp [op=0x2 flags=0x80]        OFFER received (broadcast)
       send request dst=255.255.255.255 ...
       DHCP timeout after 3 retries             broadcast ACK never arrived → DHCP failed -7

DNS:   submitting DNS query (server = gateway 192.167.1.1)
       Query timeout ; DNS attempt 1/5 failed (ret=-101)   -ENETUNREACH: gateway ARP reply
       ... ×5 → DNS resolution failed                      missed, so the query is unsendable

What we tried (none of it fixes it)

  • WIFI_PS_PARAM_TIMEOUT (post-TX wake window): applied successfully, no observable change.
  • CONFIG_NRF70_RPU_PS_IDLE_TIMEOUT_MS: tried 10 / 25 / 100 ms, identical residual.
  • WIFI_PS_PARAM_EXIT_STRATEGY = EVERY_TIM: improves the buffered legs (DHCP/DNS) but not the TCP SYN-ACK, since nothing is buffered there, the AP delivers it directly.
  • Only WIFI_PS_PARAM_STATE = disabled (Constantly Awake Mode) eliminates it: the receiver stays on, the STA ACKs normally, retransmits drop to single digits or zero.

Questions

  1. Is it expected that, with CONFIG_NRF_WIFI_LOW_POWER=y, the RPU powers the receiver down without sending a PM=1 Null first? Per 802.11 a STA in active mode (PM=0) must keep its receiver on; to sleep it must announce PM=1 so the AP buffers.
  2. If this is by design: is there a way to make the RPU announce PM=1 before sleeping (so the AP buffers and TIM-signals the reply), or to keep the RX up while a recently active flow may still receive an immediate reply (for example during a TCP handshake)?
  3. If not by design, this looks like an RPU firmware defect, please advise.

Best regards and thank you in advance !

Parents
  • We kept the sniffer running and re-analysed a fresh session at the raw 802.11 layer. Two refinements to the original report, the second one extends it.

    1. The clearest reproduction is gateway ARP, no decryption needed at all

    Every failed DNS cycle in 10 capture sessions failed with -ENETUNREACH: the DNS server is the gateway, and the gateway's ARP reply is what gets lost. The pattern, one failed and one successful attempt 4 s apart in the same association (device 14:e2:89:11:1e:32, gateway behind the AP 70:a7:41:f7:57:e6, RSSI -60):

    failed attempt
    40.836  STA → broadcast   ARP request (QoS Data, 134 B)
    40.850  GW  → STA         ARP reply, first delivery        +14 ms
    40.852  STA → AP          Null PM=1                        sleep announced 16 ms after own TX
    40.850 .. 40.870  GW → STA  14 copies, wlan.fc.retry=1, none ACKed, AP gives up
       next beacon TIM: partial virtual bitmap 78bb07 → 7abb07 = AID 1 (us) now buffered
    40.938 .. 44.35   STA sends Null PM=0 / PM=1 several times, AP delivers nothing,
                      TIM bit for AID 1 stays set the whole time
    
    successful attempt, same association, 4 s later
    44.841  STA → broadcast   ARP request
    44.841 .. 44.851  GW → STA  ~20 copies of the reply, STA finally ACKs one ~10 ms in
    44.851  STA → GW          DNS query, cycle proceeds normally
    

    In another window the reply arrived 0.7 ms after the device's own transmission and was already lost. So the receiver is off essentially at TX completion, and whether a connection attempt works is a race between the RPU re-opening its RX and the AP exhausting its MAC retry budget (16 to 19 retransmissions in 5 to 10 ms). This also explains why CONFIG_NRF70_RPU_PS_IDLE_TIMEOUT_MS 10/25/100 ms made no difference for us: the reply lands inside any of those windows.

    2. When PM=1 is sent, the buffered frame is still never retrieved

    This is the part that is new compared to the original post. In the ARP case above the station did announce PM=1 (16 ms after its TX, 2 ms after the first delivery had already died un-ACKed). The AP then does everything right: it buffers the next reply and sets the TIM bit for AID 1, verified in the beacons (78bb07 → 7abb07, bitmap offset 0). The station announces PM=0 several times over the following 3.5 s and the AP delivers nothing; the TIM bit stays set until the station gives up and re-ARPs. So both legs of power save fail: receiving while advertised active, and retrieval of buffered traffic after advertised sleep. Tuning the exit strategy so the AP buffers more (EVERY_TIM) cannot help if the pickup never happens.

    One more data point: the station's own Null and data frames are sent 3 to 4 times back to back (retry=1), so it also misses the AP's ACKs to its own transmissions right after TX, consistent with the RX being gated off at TX completion.

    This narrows question 2 from the original post: even when the PM=1 Null does go out and the AP buffers correctly, why does the RPU not collect the buffered frame on its next wake?

Reply
  • We kept the sniffer running and re-analysed a fresh session at the raw 802.11 layer. Two refinements to the original report, the second one extends it.

    1. The clearest reproduction is gateway ARP, no decryption needed at all

    Every failed DNS cycle in 10 capture sessions failed with -ENETUNREACH: the DNS server is the gateway, and the gateway's ARP reply is what gets lost. The pattern, one failed and one successful attempt 4 s apart in the same association (device 14:e2:89:11:1e:32, gateway behind the AP 70:a7:41:f7:57:e6, RSSI -60):

    failed attempt
    40.836  STA → broadcast   ARP request (QoS Data, 134 B)
    40.850  GW  → STA         ARP reply, first delivery        +14 ms
    40.852  STA → AP          Null PM=1                        sleep announced 16 ms after own TX
    40.850 .. 40.870  GW → STA  14 copies, wlan.fc.retry=1, none ACKed, AP gives up
       next beacon TIM: partial virtual bitmap 78bb07 → 7abb07 = AID 1 (us) now buffered
    40.938 .. 44.35   STA sends Null PM=0 / PM=1 several times, AP delivers nothing,
                      TIM bit for AID 1 stays set the whole time
    
    successful attempt, same association, 4 s later
    44.841  STA → broadcast   ARP request
    44.841 .. 44.851  GW → STA  ~20 copies of the reply, STA finally ACKs one ~10 ms in
    44.851  STA → GW          DNS query, cycle proceeds normally
    

    In another window the reply arrived 0.7 ms after the device's own transmission and was already lost. So the receiver is off essentially at TX completion, and whether a connection attempt works is a race between the RPU re-opening its RX and the AP exhausting its MAC retry budget (16 to 19 retransmissions in 5 to 10 ms). This also explains why CONFIG_NRF70_RPU_PS_IDLE_TIMEOUT_MS 10/25/100 ms made no difference for us: the reply lands inside any of those windows.

    2. When PM=1 is sent, the buffered frame is still never retrieved

    This is the part that is new compared to the original post. In the ARP case above the station did announce PM=1 (16 ms after its TX, 2 ms after the first delivery had already died un-ACKed). The AP then does everything right: it buffers the next reply and sets the TIM bit for AID 1, verified in the beacons (78bb07 → 7abb07, bitmap offset 0). The station announces PM=0 several times over the following 3.5 s and the AP delivers nothing; the TIM bit stays set until the station gives up and re-ARPs. So both legs of power save fail: receiving while advertised active, and retrieval of buffered traffic after advertised sleep. Tuning the exit strategy so the AP buffers more (EVERY_TIM) cannot help if the pickup never happens.

    One more data point: the station's own Null and data frames are sent 3 to 4 times back to back (retry=1), so it also misses the AP's ACKs to its own transmissions right after TX, consistent with the RX being gated off at TX completion.

    This narrows question 2 from the original post: even when the PM=1 Null does go out and the AP buffers correctly, why does the RPU not collect the buffered frame on its next wake?

Children
No Data
Related