This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

BLE_HCI_INSTANT_PASSED disconnection after LL_CHANNEL_MAP_REQ during flash erase

We're seeing an issue where the soft device stops responding and ultimately disconnects with the reason BLE_HCI_INSTANT_PASSED.  The disconnection only seems to happen when a LL_CHANNEL_MAP_REQ message is received while the Nordic is performing a flash erase.  The Wireshark capture from a sniffer shows the master is resending the LL_CHANNEL_MAP_REQ, but the slave doesn't send a response until after the instant has passed, triggering the BLE_HCI_INSTANT_PASSED disconnection.

During the time when the flash erase is happening, we've captured instances with and without the channel map update request.  Without the request, the device also stops responding, but eventually recovers:

If the slave receives a LL_CHANNEL_MAP_REQ message during the erase, it misses the instant and disconnects: 

We're using Soft Device S140 Version 6.1.1.

Is there any reason why the slave would stop responding during the flash erase if we're using the soft device flash API?  What determines how far into the future the instant is calculated during a channel map update?  Is there any other explanation for this behavior?

Thanks!

Parents
  • Hello,

    What connection interval do you use?

    Can you send the sniffer trace as a .pcapng file? 

    Flash operations can't be done at the same time as the radio is running. What kind of flash erase do you do? Are you using fstorage directly, or together with fds?

    BR,
    Edvin

  • Hi Edvin,

    Thanks for your input.  The connection interval is 15 ms.  We're using fstorage and the Soft Device flash API.  Unfortunately, I can't share the full .pcapng file since it contains sensitive client information.

    It's my understanding that using the flash API should take care of scheduling the flash operations so that they don't interfere with the radio operations.  According to the datasheet, the flash erase can halt the CPU for ~85 ms.  I assumed that the Soft Device breaks this up into partial erases in order to meet radio timing requirements.  However, the trace suggests that it's actually missing connection intervals while the erase is happening.

    Is it expected that the soft device will miss some connection intervals in order to perform the erase?  It seems like missing the connection intervals only becomes a problem when the channel map update is received and the slave needs to respond and update its channel map within 6 connection intervals (90 ms), when the instant occurs. 

    I wonder if the situation could be improved by increasing the connection interval slightly so that the erase (plus some margin) will finish in under 6 connection intervals.  Thoughts?

  • I am going to try increasing the connection interval and see what happens.

    When I mentioned partial erases, I was referring to this feature of the NVM controller:   https://infocenter.nordicsemi.com/index.jsp?topic=%2Fps_nrf52840%2Fnvmc.html

    It seems like this feature would be particularly useful for the soft device.  Do you know if it uses the partial erase?

  • I wasn't aware of this, actually. Did you try it out?

     

    BretH said:
    Due to the guaranteed delivery feature of the LL, this seems to be a hole in the BLE spec, or is there a possibility of the softdevice to recover without disconnecting?

     It is not possible to not disconnect, unfortunately. The central will also consider the link lost, because this packet is not replied to. I think the central decides what instance the LL_CHANNEL_MAP_REQ will apply from. If it says 6 connection intervals, then in my view, this is the "bad behavior", since this is far shorter than the connection timeout itself (but not illegal as far as I know). 

  • I increased the connection interval to 22.5 ms and ran a stress test we've been using to detect this issue. It performed 50 iterations in row without a single failure. Previously the failure rate was about 1 in 10.  Unfortunately, increasing the connection interval isn't an option for this project for reasons that aren't related to this issue, but the results of this test support our understanding of the failure mode.

    I wasn't aware of this, actually. Did you try it out?

    I think you mean try the partial erase?  That happens in the soft device, so I don't think I can modify it.

    I noticed there is a bug fix in the release notes of next version of the Soft Device (7.0.1, we’re using 6.1.1), that seems somewhat related, but doesn’t explicitly apply to the nRF52840:

    The wording makes is sound like the bug is just scheduling more time than is needed and not that it actually causes any connection issues.  Can you comment on this?  Again, we're too far along in our development process to switch soft device versions for this project, but it would be good to know for the future.

  • I think the central decides what instance the LL_CHANNEL_MAP_REQ will apply from. If it says 6 connection intervals, then in my view, this is the "bad behavior", since this is far shorter than the connection timeout itself (but not illegal as far as I know). 

    I reviewed a handful of BLE sniffer trace logs that involved masters of Android and iOS phones as well as Windows desktops, and almost all used instants that were 6-8 connection events later. It seems this is typical behavior.

    The central will also consider the link lost, because this packet is not replied to.

    Does this mean that although BLE is usually robust with retries, the channel map update or connection param update instant must have a round trip packet exchange in the specific connection event? If the slave fails to receive the packet or the master fails to receive the response, a connection will always disconnect? It seems like in general, there would be more frequent disconnections occurring especially if these update procedures happen when devices are at range or in a negative RF environment where packet retries would occur regularly.

  • BretH said:
    Does this mean that although BLE is usually robust with retries, the channel map update or connection param update instant must have a round trip packet exchange in the specific connection event?

     Yes. According to the BLE specification, the central decides on connection parameters. Since this "request" is effective from a specific connection event, if the peripheral fails to reply/ACK this packet before this event, the devices shall disconnect. 

    I tried to find the section that says so in the spec, but without luck, but it is also mentioned here:
    https://stackoverflow.com/questions/48447645/android-ble-peripheral-disconnects-with-status-code-ble-hci-instant-passed0x28

    What surprises me is that the softdevice performs the erase page operation even though it is in a connection with a connection interval that would suggest it doesn't have time for it. Is it the peer manager that performs this erase page, or do you do it manually? Is there some way for me to reproduce this? I thought that the softdevice or peer manager would see that there is no time to do this at this point, and save the operation for later. Can you please describe where the erase page call is coming from?

    BR,

    Edvin

Reply
  • BretH said:
    Does this mean that although BLE is usually robust with retries, the channel map update or connection param update instant must have a round trip packet exchange in the specific connection event?

     Yes. According to the BLE specification, the central decides on connection parameters. Since this "request" is effective from a specific connection event, if the peripheral fails to reply/ACK this packet before this event, the devices shall disconnect. 

    I tried to find the section that says so in the spec, but without luck, but it is also mentioned here:
    https://stackoverflow.com/questions/48447645/android-ble-peripheral-disconnects-with-status-code-ble-hci-instant-passed0x28

    What surprises me is that the softdevice performs the erase page operation even though it is in a connection with a connection interval that would suggest it doesn't have time for it. Is it the peer manager that performs this erase page, or do you do it manually? Is there some way for me to reproduce this? I thought that the softdevice or peer manager would see that there is no time to do this at this point, and save the operation for later. Can you please describe where the erase page call is coming from?

    BR,

    Edvin

Children
  • Can you please describe where the erase page call is coming from?

    The erase is made through a call to nrf_fstorage_erase() with the nrf_fstorage_sd backend. I assume this is easy to reproduce.

    With a connection interval of 15ms, the soft device will be unable to schedule the erase until a disconnection, so perhaps it goes ahead and performs the operation immediately.

    Regardless, we now understand the expected behavior. We will resolve our issue by performing a thorough flash erase within bank 1 of flash during mcu init rather than doing the page erases just-in-time during a BLE connection.

  • Hello,

     

    BretH said:
    We will resolve our issue by performing a thorough flash erase within bank 1 of flash during mcu init rather than doing the page erases just-in-time during a BLE connection.

     Actually, I wouldn't recommend this. I don't know if you are aware, but the flash on the nRF has a "guaranteed" flash write/erase cycle of 10 000 cycles. See "Endurance" here.

    I believe it is more common to use FDS, but the principle is the same. You should wait until you get some "flash is full" event before you start erasing. 

    Particularly erasing the flash on startup is unfortunate, because this will use up one cycle for that page. At one point, if the device is battery powered, when the battery starts running low you may have several resets, because the chip draws current, the battery's voltage drops due to this draw, the chip detects that the voltage is below the threshold and powers off and stops drawing current, the battery voltage dips back up and powers on the chip again. In this case, if you do an erase page, you may use up many (!) erase cycles for no use. 

    If possible, you should rather do the flash erase in the disconnected events, or when the flash is full. 

    I don't know what sort of data you are storing, and how you handle the addressing, but if you have not considered FDS, you should at least read about it. It will handle the addressing and updating of flash records, so that you perhaps don't have to do an erase that often, because it utilizes the entire flash pages dedicated for the FDS.

    I don't know when you would typically do a page erase. It shouldn't be too often, so the chance of this colliding with the channel map shouldn't be that big. If it occurs often, then perhaps you are doing an erase immediately after connection? Try to avoid this time specifically.

    I see from the documentation that the flash operations are on the same priority as the normal connection events, so what I said about it being blocked is not true it looks like. Since the connection is not about to time out at the time it starts the erase, it will be allowed to do so.

    Try to avoid erasing flash pages during the initialization of the connection establishment, and if possible, try to avoid it while being in a connection.

    BR,
    Edvin

  • Thanks for the feedback. In our case, we are writing up to 400kBytes of short-term data to flash all at once. FDS isn't suitable for this specific scenario. We are successfully using FDS for other persistent parameters.

    The 400kB write will only happen a few times in the lifetime of the device. The flash erase at init is only performed if flash is not already erased, meaning there won't be excessive erase cycles. We read each page to make sure it is not already erased before erasing.

    Lastly, our device is wall-powered, so voltage drops at startup are low risk.

    I think we have all of our questions answered. Thanks for the input!

  • Thanks for the in depth responses, Edvin.  Like BretH said, I think you've answered everything we need to know.

Related