This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Eliminate BLE_HCI_INSTANT_PASSED 0x28 BLE disconnection

Hello Nordic Team, 

We are developing a wearable device. We noticed spontaneous BLE disconnections every 1-2 hours. The error code is BLE_HCI_INSTANT_PASSED. I have checked all related questions in devzone, this one was especially useful. According to it, there are 3 root causes: LFCK inaccuracy, Radio Interference, and software that blocks BLE operations. 

The problem appears mostly when the device is worn, i.e. the interference is increased from human body, clothing, and other obstacles (e.g. chair). Otherwise when the device is on the table, the issue is very hard to reproduce. That's why, I believe that the root cause is radio interference and probably signal weakness because of distance. 

In any case, regarding LFCK, we chose external capacitors according to guidelines provided by Nordic. We also tried to use capacitors calculated with another formula given by the crystal manufacturer (it slightly differs from Nordic formula, and results in ~5pF difference). Also, we tried to increase LFCK accuracy from 20 to 250, and it didn't help. 

Regarding software, we have priority 2 for TWI and PWM. PWM calls have been specifically disabled throughout the tests, so it shouldn't have any effect. TWI is used, but in blocking mode. If I am not mistaken, it should not interfere with Softdevice. Please correct me if I am wrong. 

My understanding of this error is following: For example, Master is at 10th connection event. When it sends commands to update communication channels or to update connection parameters (the "special requests"), Slave has to respond within 6 next connection events. So, if we increase the connection interval, the overall time-window covering these 6 connection events will increase, thus if there was a noisy/blocking environment, by prolonging the time, we have higher chances that the problem will disappear before the connection is lost. Furthermore, when we increase the slave latency from 0 to lets say 5, then the wait window is increased to 11 instead of 6 connection events. Can you confirm that my understanding is correct? 

Hoping that it will help, I did following modifications:

Old configuration:

  • Min. Interval: 20 ms
  • Max. Interval: 75 ms
  • Latency: 0
  • Timeout: 15 sec

New configuration:

  • Min. Interval: 30 ms
  • Max. Interval: 90 ms
  • Latency: 5
  • Timeout: 15 sec

But it didn't help. 

I want to ask following:

  1. If my understanding about help of latency to this issue is correct, can I increase the latency to much higher value such as 30, 40? This would mean that if packets are lost 40 times in a row, the connection would be still alive, right? 
  2. Instant Passed occurs after 6th connection event. Can we increase this "6" somewhere? Can central modify it? Our central is a smartphone.  
  3. What would you recommend for starters to use for connection interval min/max values? 

Please let me know if there are any other ideas that come to your mind. 

I would greatly appreciate your assistance. 

Parents
  • Hi,

    Regarding software, we have priority 2 for TWI and PWM. PWM calls have been specifically disabled throughout the tests, so it shouldn't have any effect. TWI is used, but in blocking mode. If I am not mistaken, it should not interfere with Softdevice. Please correct me if I am wrong. 

    Correct, time critical events such as keeping the connection alive will not be blocked with applications running at priority 2. 

    My understanding of this error is following: For example, Master is at 10th connection event. When it sends commands to update communication channels or to update connection parameters (the "special requests"), Slave has to respond within 6 next connection events. So, if we increase the connection interval, the overall time-window covering these 6 connection events will increase, thus if there was a noisy/blocking environment, by prolonging the time, we have higher chances that the problem will disappear before the connection is lost. Furthermore, when we increase the slave latency from 0 to lets say 5, then the wait window is increased to 11 instead of 6 connection events. Can you confirm that my understanding is correct

    Yes, this should be correct.

    If my understanding about help of latency to this issue is correct, can I increase the latency to much higher value such as 30, 40? This would mean that if packets are lost 40 times in a row, the connection would be still alive, right? 

    Yes, you can increase it up to 59 with the connection parameters specified under new configuration.

    Instant Passed occurs after 6th connection event. Can we increase this "6" somewhere? Can central modify it? Our central is a smartphone.  

    No, this is part of the BLE spec.

    1. What would you recommend for starters to use for connection interval min/max values? 

    Depends on several factors, such as current consumption etc. See this section.

    Can you share the datasheet and full part number of both your HFXO and LFXO crystal? Also, what is the size of the parallel load capacitors of both the HFXO and LFXO crystal?

    regards

    Jared 

  • Hi Jared, 

    Thanks a lot for you assistance. I will try to play with connection interval values and further increase the latency. 

    Regarding LFXO, it is ECS-.327-12.5-12-TR. Links to the datasheet and to the design guide. The latter document has the formula for cap. calculation on the last page. Actually when I googled capacitor calculation formula for crystals, most have been using formula in the crystal document rather than the one used in Nordic documentation. We have been initially using 12pF caps from the reference layout. But later noticed that these values were for 9pF crystal, thus incorrect since we are using 12.5pF crystal. Then we tried capacitors calculated with Nordic formula, the required value was 21pF, we used 22pF that we had available. Then we tried caps calculated from the crystal manufacturer formula, the required value was 15pF, and we used it. None of these solved the problem. 

    We are using BMD-350 module, and it has its own 32MHz clock. Text from the module datasheet:

    "BMD-350 requires two clocks, a high frequency clock and a low frequency clock. The high frequency clock is provided on-module by a high-accuracy 32 MHz crystal as required by the nRF52832 for radio operation. The low frequency clock can be provided internally by an RC oscillator or synthesized from the fast clock, or externally by a 32.768 kHz crystal. An external crystal provides the lowest power consumption and greatest accuracy. Using the internal RC oscillator with calibration provides acceptable performance for Bluetooth low energy applications at a reduced cost and slight increase in power consumption."

    Unfortunately, I was not able to find any other information regarding the HFCK it is using from its datasheet. We tried to open up the module and see the part number, but we weren't able to identify the model. I will keep looking for it, and update you if succeed to find.

    Waiting for your feedback.  

  • No we didn't perform antenna tuning. In fact its firs time I hear about it. I only considered increasing peripheral Tx strength from the default 0dBm to 4dBm, but haven't tried it yet. What is the benefit of tuning and how is it done? Should it be done to nRF52 modules as well?

  • Hi,

    I'm sorry, I read your previous reply a bit fast. No, it should not be done if you're using a module. Could you try increasing TX power and see if increasing the latency mitigate the issue?

    regards

    Jared 

  • Ok here are the initial results:

    Firstly, I updated connection parameters to the following:

    • Min/Max connection interval: 60ms
    • Latency: 30
    • Timeout: 15 seconds

    Beyond these I would be out of specs for Apple Design guidelines regarding connection parameters (actually I am beyond the guidelines for the timeout, but haven't noticed any issues with it yet). I tried to increase these parameters further, but master wouldn't allow them and would override parameters with lesser values. Anyway, the 0x28 error occurred again with these parameters.

    Then, I increased the Tx power from 0dBm to 4dBm, and this time we got 0x8 error, which is timeout. 

    I checked the sniffer logs, and saw that slave replies to master every 30th time due to latency. In other words, slave replies every ~2 seconds. Given 15 seconds timeout, if the slave doesn't reply 7 times in a row, it will get disconnected. Since 7 packets it pretty low number for timeout, I believe it is normal behavior.

    Now if we go back to the first test results, where power was 0dBm with latency 30, it would have taken 30+6 = 36 consecutive lost packets for the 0x28 error to happen. I doubt that it is possible due to radio interference. To elaborate, we have a test setup with a person wearing the device, sitting on the chair, and phone on the table right in front of the person. The distance is small. During this setup, when I check the sniffer logs, overall packet loss ratio is pretty satisfying, max 2-3 consecutive lost packets. When disconnection occurs, this would mean that now the device has 36 lost packets. But the testing conditions are the same. I highly doubt that same radio interference conditions would result in such high different loss rate. 

    Perhaps then problem is in software, and after 1-2 hours master and slave somehow get out of sync in BLE communication. If you remember, I told that we had priority 2 TWI in our code. I forgot to mention that this TWI is called inside app_timer interrupt, not in the main context. We plan to use to scheduler, but haven't done it yet. This interrupt takes ~250ms. And it is called every second. May it be possible that it somehow disturbs the timing of BLE so that after some time communication is out of sync? 

  • We did additional test with following parameters:

    Min/Max con. interval: 45ms

    Latency: 9

    Timeout: 15s 

    Tx power: +4dBm

    We still received Instant Passed error approximately in 2 hours. I believe problem is not in connection parameters. 

    We now decided to test with different phone model (all previous tests were conducted with Redmi Note 8T). Perhaps the problem is phone's clock accuracy or something else. I will let you know the results. 

  • Hi there,

    Rafig said:

    Beyond these I would be out of specs for Apple Design guidelines regarding connection parameters (actually I am beyond the guidelines for the timeout, but haven't noticed any issues with it yet). I tried to increase these parameters further, but master wouldn't allow them and would override parameters with lesser values. Anyway, the 0x28 error occurred again with these parameters.

    Then, I increased the Tx power from 0dBm to 4dBm, and this time we got 0x8 error, which is timeout. 

    I checked the sniffer logs, and saw that slave replies to master every 30th time due to latency. In other words, slave replies every ~2 seconds. Given 15 seconds timeout, if the slave doesn't reply 7 times in a row, it will get disconnected. Since 7 packets it pretty low number for timeout, I believe it is normal behavior.

    Now if we go back to the first test results, where power was 0dBm with latency 30, it would have taken 30+6 = 36 consecutive lost packets for the 0x28 error to happen. I doubt that it is possible due to radio interference. To elaborate, we have a test setup with a person wearing the device, sitting on the chair, and phone on the table right in front of the person. The distance is small. During this setup, when I check the sniffer logs, overall packet loss ratio is pretty satisfying, max 2-3 consecutive lost packets. When disconnection occurs, this would mean that now the device has 36 lost packets. But the testing conditions are the same. I highly doubt that same radio interference conditions would result in such high different loss rate. 

    Thanks for the update. I have some suggestions below:

    Rafig said:
    Perhaps then problem is in software, and after 1-2 hours master and slave somehow get out of sync in BLE communication. If you remember, I told that we had priority 2 TWI in our code. I forgot to mention that this TWI is called inside app_timer interrupt, not in the main context.

    Can you clarify this part. How is the TWI implemented? It's a bit confusing that you use both interrupt priority and you run the TWI in blocking mode which doesn't use interrupts. Could you share this part of your code? I'm guessing that you're spinning in a loop in the app timer interrupt handler waiting for a flag to be set by the TWI interrupt handler as we do in our examples? 

    Either way, running high priority TWI communication from interrupt context is definitely not recommended. As previously mentioned, timing critical events should not be blocked, but I'm suspecting that Softdevice operation is indirectly affected since Softdevice application events will be blocked. I'm not exactly sure what events, but I have a suggestion that will highlight if this is the issue or not:

    Could you try to move the TWI communication to main context? You can do this by setting a flag in the app timer interrupt handler that signalizes to the main loop that it can start the TWI transaction. You can then reset this flag when the transaction has ended in the TWI handler. Also try lowering the TWI priority.

    Rafig said:
    This interrupt takes ~250ms. And it is called every second

    Also 250 ms is a lot. Is there a specific reason for this, file transfers?

    regards

    Jared 

Reply
  • Hi there,

    Rafig said:

    Beyond these I would be out of specs for Apple Design guidelines regarding connection parameters (actually I am beyond the guidelines for the timeout, but haven't noticed any issues with it yet). I tried to increase these parameters further, but master wouldn't allow them and would override parameters with lesser values. Anyway, the 0x28 error occurred again with these parameters.

    Then, I increased the Tx power from 0dBm to 4dBm, and this time we got 0x8 error, which is timeout. 

    I checked the sniffer logs, and saw that slave replies to master every 30th time due to latency. In other words, slave replies every ~2 seconds. Given 15 seconds timeout, if the slave doesn't reply 7 times in a row, it will get disconnected. Since 7 packets it pretty low number for timeout, I believe it is normal behavior.

    Now if we go back to the first test results, where power was 0dBm with latency 30, it would have taken 30+6 = 36 consecutive lost packets for the 0x28 error to happen. I doubt that it is possible due to radio interference. To elaborate, we have a test setup with a person wearing the device, sitting on the chair, and phone on the table right in front of the person. The distance is small. During this setup, when I check the sniffer logs, overall packet loss ratio is pretty satisfying, max 2-3 consecutive lost packets. When disconnection occurs, this would mean that now the device has 36 lost packets. But the testing conditions are the same. I highly doubt that same radio interference conditions would result in such high different loss rate. 

    Thanks for the update. I have some suggestions below:

    Rafig said:
    Perhaps then problem is in software, and after 1-2 hours master and slave somehow get out of sync in BLE communication. If you remember, I told that we had priority 2 TWI in our code. I forgot to mention that this TWI is called inside app_timer interrupt, not in the main context.

    Can you clarify this part. How is the TWI implemented? It's a bit confusing that you use both interrupt priority and you run the TWI in blocking mode which doesn't use interrupts. Could you share this part of your code? I'm guessing that you're spinning in a loop in the app timer interrupt handler waiting for a flag to be set by the TWI interrupt handler as we do in our examples? 

    Either way, running high priority TWI communication from interrupt context is definitely not recommended. As previously mentioned, timing critical events should not be blocked, but I'm suspecting that Softdevice operation is indirectly affected since Softdevice application events will be blocked. I'm not exactly sure what events, but I have a suggestion that will highlight if this is the issue or not:

    Could you try to move the TWI communication to main context? You can do this by setting a flag in the app timer interrupt handler that signalizes to the main loop that it can start the TWI transaction. You can then reset this flag when the transaction has ended in the TWI handler. Also try lowering the TWI priority.

    Rafig said:
    This interrupt takes ~250ms. And it is called every second

    Also 250 ms is a lot. Is there a specific reason for this, file transfers?

    regards

    Jared 

Children
  • Hi Jared, 

    Sorry for confusing. It is indeed blocking mode, I mentioned priority 2 just in case it had some internal use inside TWI. But as I understand, it doesn't have any effect. (We don't pass event handler to initialization). 

    To test, we completely disabled TWI related activities in our software, we disabled the timers that were working every second, and in result our application was not doing anything except keeping up BLE communication, and some active GPIOTEs to which it was listening (no fired interrupts). 


    We still got disconnection with code 0x28. We also tried with other LFCK which is 9pF, and put 12pF external capacitors (as in reference design) during this test. 

    So, it seems issue is neither application, nor clock. 

    During this test we had following parameters:

    Min/max interval: 45ms

    Latency: 6

    Timeout: 15

    Error happened ~15 minutes after start of test. The phone was the old phone that we used in our tests previously (Redmi Note 8T). I will try increasing LFCK accuracy to 250 or 500 again to see if it helps. 

  • Also, I wanted to ask you about the nRF52 module that we use (BMD-350). It has very small physical size. Actually it is one of smallest nRF52 modules I have ever saw. May it be that its antenna is weak somehow because of such dimensions? 

  • Hi,

    The antenna size will affect the antenna performance, but I'm not sure if that is the culprit in your case. Do you have a nRF52832 DK available? Could you try flashing your project to our development kit and compare its performance with your custom board? Any significant difference? The result should indicate whether this is a HW or SW problem. 

    regards

    Jared  

  • Hello Jared.

    Sorry for the small delay, we were conducting various tests. 

    Unfortunately testing our code with DK is complicated because of many peripherals that need to be connected.

    We tried the opposite, we used an example from SDK (ble_app_template) and flashed it on our hardware. We made some minor changes to the example to indicate the disconnection status, and used following connection parameters:

    • min/max interval: 15ms
    • latency: 15
    • timeout: 30s
    • lfck accuracy: 20ppm
    • tx power: 0dBm (default)

    Result is following:

    When device was tested in standard conditions (worn by the user), we saw instant passed errors. 

    Then we tested the same device, but this time device was not worn by the person, instead phone and device were both on the table without any obstacles between them. This time we didn't see any disconnection for ~9 hours. 

    In the datasheet of module that we use, I came across following text:

    "Care should be taken when designing and placing the BMD-350 into an enclosure. Metal should be kept clear from the antenna area, both above and below. Any metal around the module can negatively impact RF performance.

    The module is designed and tuned for the antenna and RF components to be in free air. Any potting, epoxy fill, plastic over-molding, or conformal coating can negatively impact RF performance and must be evaluated by the customer."

    I am not sure how to interpret this information. As most of electronic devices, our PCB is enclosed inside plastic case, in which the antenna is in free air. It is not clear for me if they mean that antenna should should not be covered directly by other material, or they mean that antenna should be outside of plastic case. 

    Waiting for your feedback. 

  • Hi,

    Rafig said:
    I am not sure how to interpret this information. As most of electronic devices, our PCB is enclosed inside plastic case, in which the antenna is in free air. It is not clear for me if they mean that antenna should should not be covered directly by other material, or they mean that antenna should be outside of plastic case. 

    I think this question should be directed to the module makers. But in general, when you tune an antenna you'll have to do it on a physical body if that's the use case for the device. If the module makers didn't take this into consideration( they probably didn't) when they tuned the antenna then it's very likely that you'll get a worse result when you put your device next to a body. The thought behind testing with the DK was to see if this issue was HW dependent or not. 

    Adjusting the connection parameters might mitigate the issue:

    Could you try increasing either the interval or the latency so that: 

    interval *(latency +6) > 1sec? 

    Also, my colleague Kenneth has some useful suggestions in this case.

    regards

    Jared 

Related