This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

BLE_HCI_INSTANT_PASSED disconnection after LL_CHANNEL_MAP_REQ during flash erase

We're seeing an issue where the soft device stops responding and ultimately disconnects with the reason BLE_HCI_INSTANT_PASSED. The disconnection only seems to happen when a LL_CHANNEL_MAP_REQ message is received while the Nordic is performing a flash erase. The Wireshark capture from a sniffer shows the master is resending the LL_CHANNEL_MAP_REQ, but the slave doesn't send a response until after the instant has passed, triggering the BLE_HCI_INSTANT_PASSED disconnection.

During the time when the flash erase is happening, we've captured instances with and without the channel map update request. Without the request, the device also stops responding, but eventually recovers:

If the slave receives a LL_CHANNEL_MAP_REQ message during the erase, it misses the instant and disconnects:

We're using Soft Device S140 Version 6.1.1.

Is there any reason why the slave would stop responding during the flash erase if we're using the soft device flash API? What determines how far into the future the instant is calculated during a channel map update? Is there any other explanation for this behavior?

Thanks!

Parents

0 Edvin over 5 years ago

Hello,

What connection interval do you use?

Can you send the sniffer trace as a .pcapng file?

Flash operations can't be done at the same time as the radio is running. What kind of flash erase do you do? Are you using fstorage directly, or together with fds?

BR,
Edvin
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 rkorn over 5 years ago in reply to Edvin

Hi Edvin,

Thanks for your input. The connection interval is 15 ms. We're using fstorage and the Soft Device flash API. Unfortunately, I can't share the full .pcapng file since it contains sensitive client information.

It's my understanding that using the flash API should take care of scheduling the flash operations so that they don't interfere with the radio operations. According to the datasheet, the flash erase can halt the CPU for ~85 ms. I assumed that the Soft Device breaks this up into partial erases in order to meet radio timing requirements. However, the trace suggests that it's actually missing connection intervals while the erase is happening.

Is it expected that the soft device will miss some connection intervals in order to perform the erase? It seems like missing the connection intervals only becomes a problem when the channel map update is received and the slave needs to respond and update its channel map within 6 connection intervals (90 ms), when the instant occurs.

I wonder if the situation could be improved by increasing the connection interval slightly so that the erase (plus some margin) will finish in under 6 connection intervals. Thoughts?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Edvin over 5 years ago in reply to rkorn

It is correct as you say, that the erase page can take up to 85ms. This can not be split up, as it is one HW operation (the minimum erase flash size is one flash page).

Typically, this is something we see when customers try to run GC (garbage collection with FDS) when they are in a connection with a fairly short connection interval.

I actually thought that the erase flash operation would not succeed, because the softdevice sees that it doesn't have time, but I am not sure whether this decision is typically made in fstorage or fds.

But absolutely. If you need to delete a flash page, increasing the connection interval is a good idea, as long as the central accepts that, of course. If you want to, you don't need to do this until you need to delete a page in flash. When the new connection interval is set, delete the page(s) that you need to delete, and you can request the normal connection parameters again.

Best regards,

Edvin
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 BretH over 5 years ago in reply to Edvin

Hi Edvin. I am working together with rkorn.

I like the idea of temporarily increasing the connection interval. Alternatively, we can schedule our fstorage erase prior to our BLE connection.

That said, I have a general question about the LL_CHANNEL_MAP_REQ or a connection parameter update request. Both of these involve scheduling the channel map update or connection parameter update at an instant in the future.

In the Wireshark screenshot provided in the original post, we can see there are 7 back-to-back retries of the LL_CHANNEL_MAP_REQ from the master. The 7th request is finally received and processed by the SoftDevice and a return packet is sent. The first LL_CHANNEL_MAP_REQ packet was in connection event 8352 with a scheduled instant of 8358. Because the flash erase takes some time and packets are missed by the softdevice, the first LL_CHANNEL_MAP_REQ received is in the connection event 8358 - the same event as the scheduled instant! The nordic responds but ultimately disconnects with the BLE_HCI_INSTANT_PASSED reason.

While this has a low probability of occurring, we are seeing it frequently enough to be a nuisance. Our case is caused by the flash erase preventing reception of the first 6 packets, but it seems this could happen if poor link conditions were met -> the peripheral fails to receive the packets prior to the instant occuring. Due to the guaranteed delivery feature of the LL, this seems to be a hole in the BLE spec, or is there a possibility of the softdevice to recover without disconnecting? It seems like the HCI_INSTANT_PASSED is inevitable. Is this true?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 rkorn over 5 years ago in reply to Edvin

I am going to try increasing the connection interval and see what happens.

When I mentioned partial erases, I was referring to this feature of the NVM controller: https://infocenter.nordicsemi.com/index.jsp?topic=%2Fps_nrf52840%2Fnvmc.html

It seems like this feature would be particularly useful for the soft device. Do you know if it uses the partial erase?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 rkorn over 5 years ago in reply to Edvin

I am going to try increasing the connection interval and see what happens.

When I mentioned partial erases, I was referring to this feature of the NVM controller: https://infocenter.nordicsemi.com/index.jsp?topic=%2Fps_nrf52840%2Fnvmc.html

It seems like this feature would be particularly useful for the soft device. Do you know if it uses the partial erase?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 Edvin over 5 years ago in reply to rkorn

I wasn't aware of this, actually. Did you try it out?

BretH:

BretH said:
Due to the guaranteed delivery feature of the LL, this seems to be a hole in the BLE spec, or is there a possibility of the softdevice to recover without disconnecting?

It is not possible to not disconnect, unfortunately. The central will also consider the link lost, because this packet is not replied to. I think the central decides what instance the LL_CHANNEL_MAP_REQ will apply from. If it says 6 connection intervals, then in my view, this is the "bad behavior", since this is far shorter than the connection timeout itself (but not illegal as far as I know).
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 rkorn over 5 years ago in reply to Edvin

I increased the connection interval to 22.5 ms and ran a stress test we've been using to detect this issue. It performed 50 iterations in row without a single failure. Previously the failure rate was about 1 in 10. Unfortunately, increasing the connection interval isn't an option for this project for reasons that aren't related to this issue, but the results of this test support our understanding of the failure mode.

Edvin said:
I wasn't aware of this, actually. Did you try it out?

I think you mean try the partial erase? That happens in the soft device, so I don't think I can modify it.

I noticed there is a bug fix in the release notes of next version of the Soft Device (7.0.1, we’re using 6.1.1), that seems somewhat related, but doesn’t explicitly apply to the nRF52840:

The wording makes is sound like the bug is just scheduling more time than is needed and not that it actually causes any connection issues. Can you comment on this? Again, we're too far along in our development process to switch soft device versions for this project, but it would be good to know for the future.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 BretH over 5 years ago in reply to Edvin

Edvin said:
I think the central decides what instance the LL_CHANNEL_MAP_REQ will apply from. If it says 6 connection intervals, then in my view, this is the "bad behavior", since this is far shorter than the connection timeout itself (but not illegal as far as I know).

I reviewed a handful of BLE sniffer trace logs that involved masters of Android and iOS phones as well as Windows desktops, and almost all used instants that were 6-8 connection events later. It seems this is typical behavior.

Edvin said:
The central will also consider the link lost, because this packet is not replied to.

Does this mean that although BLE is usually robust with retries, the channel map update or connection param update instant must have a round trip packet exchange in the specific connection event? If the slave fails to receive the packet or the master fails to receive the response, a connection will always disconnect? It seems like in general, there would be more frequent disconnections occurring especially if these update procedures happen when devices are at range or in a negative RF environment where packet retries would occur regularly.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Edvin over 5 years ago in reply to BretH

BretH said:
Does this mean that although BLE is usually robust with retries, the channel map update or connection param update instant must have a round trip packet exchange in the specific connection event?

Yes. According to the BLE specification, the central decides on connection parameters. Since this "request" is effective from a specific connection event, if the peripheral fails to reply/ACK this packet before this event, the devices shall disconnect.

I tried to find the section that says so in the spec, but without luck, but it is also mentioned here:
https://stackoverflow.com/questions/48447645/android-ble-peripheral-disconnects-with-status-code-ble-hci-instant-passed0x28

What surprises me is that the softdevice performs the erase page operation even though it is in a connection with a connection interval that would suggest it doesn't have time for it. Is it the peer manager that performs this erase page, or do you do it manually? Is there some way for me to reproduce this? I thought that the softdevice or peer manager would see that there is no time to do this at this point, and save the operation for later. Can you please describe where the erase page call is coming from?

BR,

Edvin
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 BretH over 5 years ago in reply to Edvin

Edvin said:
Can you please describe where the erase page call is coming from?

The erase is made through a call to nrf_fstorage_erase() with the nrf_fstorage_sd backend. I assume this is easy to reproduce.

With a connection interval of 15ms, the soft device will be unable to schedule the erase until a disconnection, so perhaps it goes ahead and performs the operation immediately.

Regardless, we now understand the expected behavior. We will resolve our issue by performing a thorough flash erase within bank 1 of flash during mcu init rather than doing the page erases just-in-time during a BLE connection.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel