This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

BLE_HCI_INSTANT_PASSED Disconnection Reason

Hi Team,

We are seeing an intermittent issue where BLE connection disconnects abruptly due to reason BLE_HCI_INSTANT_PASSED (0x28). We are using nRF SDK version 15.2.0. The mobile device is Moto G7 Power which runs on Android OS. We have seen this issue 8 times in past month with the same device. Please help us with the following queries:

1. When can this happen? I have seen a stack overflow post explaining the possible reasons. Is that an exhaustive explanation?

2. How can we prevent it? Our customers can have multiple types of devices hence we are looking at a firmware level solution. We came across this link to extend connection event length. Can that help?

Thanks in advance

Parents
  • Hi Kenneth,

    Thanks for prompt response. I have few follow-ups questions:

    >> setting higher tolerance of the LFCLK source when init the softdevice may help

    1. How can we set that? Is this a config in sdk_config.h?

    2. What is the value that we can set? Is there some guidance around it?

    >> enabling (or increasing) slave latency on the link will help

    3. I think this can be done only by peer device. Is it possible for the peripheral to set slave latency?

    >> peer is operating at the maximum range (or much 2.4GHz interference from other devices),

    4. Will it help to mitigate this issue if we introduce a guidance to our customers to not use other BLE devices when operating our product?

  • pranesa said:

    1. How can we set that? Is this a config in sdk_config.h?

    2. What is the value that we can set? Is there some guidance around it?

    You can find in the sdk_config the following defines:

    NRF_SDH_CLOCK_LF_SRC
    NRF_SDH_CLOCK_LF_RC_CTIV
    NRF_SDH_CLOCK_LF_RC_TEMP_CTIV
    NRF_SDH_CLOCK_LF_ACCURACY

    It is the NRF_SDH_CLOCK_LF_ACCURACY that set the accuracy used in the LL to compute timing. If either peer have a LFCLK that is out of spec, then increasing this value may help, however it will also increase current consumption, since the peripheral now will increase the receive window on each connection event. You can in the online power profile for instance see how increasing clock accuracy impact the average current consumption in a connection:
    https://devzone.nordicsemi.com/nordic/power/w/opp/2/online-power-profiler-for-ble

    It is very difficult to comment by how much this value should be increased if the LFCLK is out of spec, but increase it so that average current consumption still meet your battery lifetime can be considered.

    pranesa said:

    >> enabling (or increasing) slave latency on the link will help

    3. I think this can be done only by peer device. Is it possible for the peripheral to set slave latency?

    It is the central device that control this yes, however the peripheral device can request the peer to update the connection parameters, and in most cases it seems that the central device will go ahead and accept and initiate connection parameter updates based on these parameters.

    In your project these are typically set in gap_params_init()->sd_ble_gap_ppcp_set() and conn_params_init()->sd_ble_gap_conn_param_update(). The ble_conn_params module can be configured to send a connection parameter update request if the parameters are outside the preferred ones. Typically you can find that the preferred SLAVE_LATENCY is set in main.c, however in sdk_config.h you can find that there is a define that control by how much slave latency can differ without sending a connection parameter update request. This define is called NRF_BLE_CONN_PARAMS_MAX_SLAVE_LATENCY_DEVIATION. So in other words you should check both these parameters and if you have ble_conn_params module enabled in your application. 

    pranesa said:

    >> peer is operating at the maximum range (or much 2.4GHz interference from other devices),

    4. Will it help to mitigate this issue if we introduce a guidance to our customers to not use other BLE devices when operating our product?

    I don't believe that this help, I don't think interference is a big problem overall. Instead you can possible add some description that if you experience frequent disconnect change location/move it closer if possible, since that will also influence the signal to noise ratio and improve link budget overall.

  • Hi Kenneth,

    >> I don't believe that this help, I don't think interference is a big problem overall.

    In our case, the master is a mobile device connecting to our peripheral. Can a mobile device concurrently connected to multiple other BLE devices like smart watch, headset etc. have budget issue for BLE link with our peripheral? 

    Thanks

  • That may have some impact yes, it could then depend on the scheduling and priority on the mobile phone when the various connections overlap, ideally those links that are doing LL procedures that contain operations from a specific instance should have priority. I do not know how this is handled today in the various mobile operating systems and stacks, it may even vary between chipsets in the mobile phone.

    Best regards,
    Kenneth

  • Hi Kenneth,

    Thanks for helping us with this issue.

    Currently NRF_SDH_CLOCK_LF_ACCURACY is set to 7 which translates to 20 PPM accuracy. We have seen a recommendation in code documentation that we should use 500 PPM accuracy if we are using NRF_CLOCK_LF_SRC_RC (RC oscillator). We are using NRF_CLOCK_LF_SRC_XTAL (crystal oscillator). I have the following query:

    1. Should we move to NRF_CLOCK_LF_SRC_RC along with 500 PPM accuracy? or can we get same accuracy by moving to 500 PPM accuracy and keeping clock source as crystal oscillator?

  • Hi Kenneth,

    I want to understand the problem-3 in a better way.

    When we say a peripheral executing any operation that have higher priority than the BLE link, should we just consider operations supported by softdevice here like BLE advertising, Flash memory access, etc. mentioned in the scheduling priorities link shared by you above? Given we run a separate freertos task to poll softdevice events, should its priority be highest among other application tasks?

    In our setup, we currently support only one BLE connection but as soon as one master is connected to our peripheral we don't stop advertising but switch to a non-connectable advertisement to advertise device state as Busy. Can that advertisement conflict with the operation of already established BLE link?

    We also access flash to retrieve security keys required to enable handshake on the established BLE link, can that also impact the BLE link operation?

    Thanks

  • karnram said:
    When we say a peripheral executing any operation that have higher priority than the BLE link, should we just consider operations supported by softdevice here like BLE advertising, Flash memory access, etc. mentioned in the scheduling priorities link shared by you above?

     Yes. 

    karnram said:
    Given we run a separate freertos task to poll softdevice events, should its priority be highest among other application tasks?

    That doesn't matter, as these control procedures are fully handled by the softdevice. 

    karnram said:
    In our setup, we currently support only one BLE connection but as soon as one master is connected to our peripheral we don't stop advertising but switch to a non-connectable advertisement to advertise device state as Busy. Can that advertisement conflict with the operation of already established BLE link?

    It is possible that advertisement event and connection event may overlap at some instances, since they are asynchronous events they will from time to time overlap in time, in such case the advertisement event may sometimes get the timeslot and thereby the connection event may be skipped one period. Example of advertisement event may get the "timeslot" is simply because it first reserved the "timeslot", the next reserved "timeslot" will be calculated when the previous ended. It's not must that can be done to avoid this, though you may try to reduce the risk of this to occur for several periods in a row by having a connection interval and advertisement interval that differs. E.g. if connection interval is 200ms, then make sure the advertisement is not the same interval or a multiple of the same interval, actually it may better that the advertisement interval is 210ms, 410ms, 610ms or similar to make sure that if there is an overlap at a specific time, the next interval is not overlapping. (I added +10ms for the advertisement interval here compared to the connection interval, since there is always a 0-10ms random delay on advertisement intervals as specified by BT spec.) 

    karnram said:
    We also access flash to retrieve security keys required to enable handshake on the established BLE link, can that also impact the BLE link operation?

    Reading should not be a problem no, only flash erase and flash write operations may have some impact. 

    pranesa said:

    Currently NRF_SDH_CLOCK_LF_ACCURACY is set to 7 which translates to 20 PPM accuracy. We have seen a recommendation in code documentation that we should use 500 PPM accuracy if we are using NRF_CLOCK_LF_SRC_RC (RC oscillator). We are using NRF_CLOCK_LF_SRC_XTAL (crystal oscillator). I have the following query:

    1. Should we move to NRF_CLOCK_LF_SRC_RC along with 500 PPM accuracy? or can we get same accuracy by moving to 500 PPM accuracy and keeping clock source as crystal oscillator?

    I assume you have an external crystal here with 20ppm tolerance, but even if you are using a crystal with 20ppm tolerance, you may consider to initiate the softdevice by specifying a 50ppm or 100ppm tolerance, this will increase the time window when receiving to allow some more drift if there is an LFCLK in the peer that is out of spec.

    Considering you are already using an external crystal I don't see the need to change to internal RC and 500ppm, but you could add some logic in your application that if you experience a disconnect with a peer due to BLE_HCI_INSTANT_PASSED, then it could on next startup initialize the softdevice with a higher tolerance (even as high as 250ppm when using external clock) to indirectly check if the problem goes away.

    This isn't exact science, but hopefully my comments make sense.

Reply
  • karnram said:
    When we say a peripheral executing any operation that have higher priority than the BLE link, should we just consider operations supported by softdevice here like BLE advertising, Flash memory access, etc. mentioned in the scheduling priorities link shared by you above?

     Yes. 

    karnram said:
    Given we run a separate freertos task to poll softdevice events, should its priority be highest among other application tasks?

    That doesn't matter, as these control procedures are fully handled by the softdevice. 

    karnram said:
    In our setup, we currently support only one BLE connection but as soon as one master is connected to our peripheral we don't stop advertising but switch to a non-connectable advertisement to advertise device state as Busy. Can that advertisement conflict with the operation of already established BLE link?

    It is possible that advertisement event and connection event may overlap at some instances, since they are asynchronous events they will from time to time overlap in time, in such case the advertisement event may sometimes get the timeslot and thereby the connection event may be skipped one period. Example of advertisement event may get the "timeslot" is simply because it first reserved the "timeslot", the next reserved "timeslot" will be calculated when the previous ended. It's not must that can be done to avoid this, though you may try to reduce the risk of this to occur for several periods in a row by having a connection interval and advertisement interval that differs. E.g. if connection interval is 200ms, then make sure the advertisement is not the same interval or a multiple of the same interval, actually it may better that the advertisement interval is 210ms, 410ms, 610ms or similar to make sure that if there is an overlap at a specific time, the next interval is not overlapping. (I added +10ms for the advertisement interval here compared to the connection interval, since there is always a 0-10ms random delay on advertisement intervals as specified by BT spec.) 

    karnram said:
    We also access flash to retrieve security keys required to enable handshake on the established BLE link, can that also impact the BLE link operation?

    Reading should not be a problem no, only flash erase and flash write operations may have some impact. 

    pranesa said:

    Currently NRF_SDH_CLOCK_LF_ACCURACY is set to 7 which translates to 20 PPM accuracy. We have seen a recommendation in code documentation that we should use 500 PPM accuracy if we are using NRF_CLOCK_LF_SRC_RC (RC oscillator). We are using NRF_CLOCK_LF_SRC_XTAL (crystal oscillator). I have the following query:

    1. Should we move to NRF_CLOCK_LF_SRC_RC along with 500 PPM accuracy? or can we get same accuracy by moving to 500 PPM accuracy and keeping clock source as crystal oscillator?

    I assume you have an external crystal here with 20ppm tolerance, but even if you are using a crystal with 20ppm tolerance, you may consider to initiate the softdevice by specifying a 50ppm or 100ppm tolerance, this will increase the time window when receiving to allow some more drift if there is an LFCLK in the peer that is out of spec.

    Considering you are already using an external crystal I don't see the need to change to internal RC and 500ppm, but you could add some logic in your application that if you experience a disconnect with a peer due to BLE_HCI_INSTANT_PASSED, then it could on next startup initialize the softdevice with a higher tolerance (even as high as 250ppm when using external clock) to indirectly check if the problem goes away.

    This isn't exact science, but hopefully my comments make sense.

Children
  • Hi Kenneth,

    We were able to analyse BLE sniffer logs for this issue and identified following root-cause:

    The Instant passed (code 40) error are for the LL Channel map indication commands sent from the phone stack to our peripheral device. The error occurs because the phone is sending the command with an instant set to +1 of the current connection event counter (meaning the very next connection event should take effect), but by the time it is sent out the internal connection event counter increases. Thus, the instant value received is same or less than the current connection event counter, which produces the stack error.

    For example, the following trace shows that the channel map indication instant was set to 81, but by the time it was sent out, the counter increased to 82.

    This issue happens when the mobile device is concurrently streaming music to a BLE headset. The mobile device being referred here is Moto G7. 

    1. Can we change something on our peripheral firmware to make it robust to such issues? Will increasing connection interval on the firmware side help in this case?

    2. As per the sniffer logs, such failures not always result in disconnection. Do you have some insights from how BLE stack works to better understand the reason for this issue?

    3. We also saw some out of sync packets sent by our peripheral firmware. One such example is shown below where the connection interval is 30ms, but the packets are communicating at ~31ms. What can be done to avoid this? Will increasing the clock PPM accuracy as discussed above help here?

  • karnram said:
    For example, the following trace shows that the channel map indication instant was set to 81, but by the time it was sent out, the counter increased to 82.

    Seems you have found the issue! According to BLE specifications there is a requirement that it should be minimum (6+slave latency) instances, this to ensure there is time for re-transmissions in case peripheral is in sleep or packets are lost. So this is a violation of BLE spec by the phone. 

    karnram said:
    1. Can we change something on our peripheral firmware to make it robust to such issues? Will increasing connection interval on the firmware side help in this case?

    I suggest to enable slave latency to see if that helps, if you are concerned slave latency will impact the throughput of the link you should be aware you can enable slave latency on the link by updating connection parameters, but yet ensure that the peripheral will behave as slave latency is not enabled by configuring local slave latency to 0 by using this api:
    https://infocenter.nordicsemi.com/topic/com.nordic.infocenter.s112.api.v7.2.0/structble__gap__opt__local__conn__latency__t.html?cp=4_7_0_1_2_1_6_49 (I linked to S112 softdevice, but similar for other softdevices). 

    karnram said:
    2. As per the sniffer logs, such failures not always result in disconnection. Do you have some insights from how BLE stack works to better understand the reason for this issue?

    This is a pure stack scheduling issue on the Moto G7; violation of BLE spec by not following BLE requirements, the peripheral can't really do anything about this other than to recover by re-connecting again. 

    karnram said:
    3. We also saw some out of sync packets sent by our peripheral firmware. One such example is shown below where the connection interval is 30ms, but the packets are communicating at ~31ms. What can be done to avoid this? Will increasing the clock PPM accuracy as discussed above help here?

    I am not sure what exactly I see here. It is not possible that the peripheral device can initiate or send any packets in a connection event without first receiving a packet from the peer. So the sniffer may have missed the packet from the phone if there is no packet shown from the phone. However if the connection interval is 30ms, and the first packet the sniffer picks up are at 31.25ms, then it also sounds like something is wrong with the timing yes. Even if you configured with a tolerance of 500ppm (equal allow 500us drift in 1 second), that means can only tolerate a drift of about 15us in 30ms, which is not close to the sniffer log of 1.25ms in 30ms. So are you sure that the connection interval is not really 31.25ms here?  

    pranesa said:
    1. Should we move to NRF_CLOCK_LF_SRC_RC along with 500 PPM accuracy? or can we get same accuracy by moving to 500 PPM accuracy and keeping clock source as crystal oscillator?

    The stability of the internal RC oscillator is (fTOL_CAL_LFRC )= 500ppm, as shown in this table:
    https://infocenter.nordicsemi.com/topic/ps_nrf52840/clock.html#unique_1494269344

    So if you are using the internal RC oscillators you should configure 500pmm. If you are using external crystal oscillator then you can configure the the tolerance to lower value since the stability is better. Yes, you may set 500ppm when using external crystal oscillator, but I don't expect it will help your current issues based on your observations with Moto G7 (violation of channel map indication).

  • Hey Kenneth,

    Thanks for your suggestion to enable slave latency. We have not seen disconnection due to reason 40 after enabling that. We have seen adverse impact on throughput as you mentioned before. We want to set local slave latency to 0 to avoid this impact.

    I tried setting it using sd_ble_opt_set function with option id BLE_GAP_OPT_LOCAL_CONN_LATENCY. I'm doing it when we receive GAP events BLE_GAP_EVT_CONNECTED and BLE_GAP_EVT_CONN_PARAM_UPDATE. Even though I set requested_latency to 0 in struct ble_gap_opt_local_conn_latency_t, the p_actual_latency is not 0 after sd_ble_opt_set call. It is actually the same value as the SL set by master. 

    I'm not sure if we doing something wrong here. Can you help us with this?

  • Here is the code snippet:

    uint16_t usActualLatency;
    ble_gap_opt_local_conn_latency_t xBleGapOptLocalConnLatency;
    xBleGapOptLocalConnLatency.p_actual_latency = &usActualLatency;
    xBleGapOptLocalConnLatency.conn_handle = p_ble_evt->evt.gap_evt.conn_handle;
    xBleGapOptLocalConnLatency.requested_latency = 0;
    xErrCode = sd_ble_opt_set( BLE_GAP_OPT_LOCAL_CONN_LATENCY, &xBleGapOptLocalConnLatency );
    if( ( xErrCode == NRF_SUCCESS ) )
    {
        NRF_LOG_INFO("The actual local slave latency: %d", *xBleGapOptLocalConnLatency.p_actual_latency);
    }
    else
    {
       NRF_LOG_ERROR("Failed to set local slave latency. %d", xErrCode);
    }

Related