This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

nRF8001 freezing

Hello,

In a quite particular configuration, the nRF8001 sometimes stops responding. The only way to make it work again is to do a hardware reset using the reset pin of the transceiver. Is there a workaround to prevent the freezing?

HW-Setup: MSP430 as a master device on the SPI bus. nRF8001 module by Insight SIP (ISP091201)

How to reproduce the problem: Using the HID setup of our device, we pair with a mobile phone (e.g. iPhone SE (2018), iOS 14.1). Then a new setup for unpaired connections without security is loaded. The new setup contains several proprietary services that are completely different from the previous HID service. Seeing the same MAC address, the phone tries to restore the connection with our device but disconnects after about seven connection intervals. The MSP430 reads the dynamic data and restarts the advertising for a new connection (still without pairing). The phone recognizes the address and connects again... After some iterations of this process, there is no more connection (active signal not available anymore) and the nRF8001 becomes unresponsive. It doesn't accept new commands anymore. When pulling nREQUEST line to zero, the nRF8001 will not pull the nREADY line to zero. The last successful command is a connect-command that is confirmed by a command response event (without error) sent by the nRF8001.

I will attach a compressed trace of the Saleae logic analyzer containing all relevant signals7450.nRF8001_freeze_20201124.zip

A simple solution is to delete the bonding information of the phone which then stops its connection attempts. This works, but the goal is to have a foolproof device with the nRF8001. Did we miss out on something? Any suggestions?

Best regards
David

Parents
  • Where in the logic trace do the chip fail to respond? is it only at the end of the logic trace where the REQN line is pulled low, or are there other occurrences?

    Could you add a workaround for Anomaly 7, so that you have a delay before you start advertising  PAN025 Product Anomaly Notification v1.5. Not sure if this is related, but if I remember correctly it could lead to the chip not responding in case the random numbers do not generate fast enough for the advertisement to start within a certain period. Are you testing this at room temperature?

    Also there are several occurrences in the trace that shows the master sending a connec command in the middle of the read dynamic data command resulting in the nRF8001 nacking this command (all the 0x83 on the right side are nacks / ACI_STATUS_ERROR_DEVICE_STATE_INVALID):

  • Hello run-ar,

    The only failure occurs at the end of the logic trace. At about 246s from start, NREQ is pulled low by the controller, but the nRF8001 does not respond (NRDY stays high).

    The trace was recorded at room temperature and the nRF8001 is not in sleep mode (condition for PAN_025 #7 ?). However, when printing additional debug information on a serial interface the freezing does not occur. Maybe the delay introduced by doing so is sufficient for the RNG. What is the expected delay at room temperature?
    We will check what happens with a delay before issuing the connect command.

    You are right, the controller should wait for the end of the transmission of dynamic data before sending another connect command. Thank you for pointing out this error.

  • At room temperature it would typically need about 128 ms to generate the random numbers needed to start advertising.

  • Hello run-ar,

    Even if the nRF8001 is not in sleep mode we added a delay of about 130 ms before each connect command. Unfortunately, it still freezes after some time. See the new trace from Saleae logic analyzer. The freezing occurs at the end of the trace at about 174.3 s where the NREQ signal is pulled low...

    nRF8001_freeze_20201202.zip

    At 47.75 s there is a hardware error event, related to file ll_lm.s0.c at line 0x3b02. Is this an indicator of some misuse?

    Best regards
    David

Reply Children
  • Hi,

    Do you have a sniffer trace that shows if the device starts advertising, and if there could be a connection request from the peer?

    Do you monitor the active signal? If so, would it be possible for you to reset the chip if there is no activity on the active line for a given period of time?

    I am wondering if you could be hitting pan issue 6 as I am not sure the problem description is correct for this one. Still looking at our issues database, but I think there is a state where the chip could become unresponsive due to this issue, so it is recommended that the application mcu monitors the active line so it can reset the chip using the reset line in case the chip becomes unresponsive. However the active line will only be toggled if the advertisement interval is slower than 30 ms.

  • Finally, I found what I was looking for, I am sorry for the long wait:

    Additional information for this issue, not included in the PAN text:
    When clocking out the HW error event it is necessary to have a short delay after the device started event, before clocking out the HW error event. If you have a fast implementation and the HW error event is clocked out to soon the nRF8001 will issue the device started event and the HW error event again and will be stuck in a loop. In ble-sdk-arduino we have a 20ms delay to make sure this doesn't happen (20ms is picked based on trail and error, so I don't know what the actual problem is).

    Note that in some cases the nRF8001 can enter a non responsive state:
    The problem when the nRF8001 stops responding is that we sometimes hit an edge condition, where the pretick processing takes more time than we have calculated due to a sparse channel map. This happens when the processing done in pretick finishes exactly before the tick comes. The assert does not trigger, but slightly later the just arrived tick is cleared when the stack goes to sleep so that there is nothing to wake the stack again.

    The two other scenarios are:
    Processing done in pretick (including computing channel map) finishes in time, i.e. before tick comes. The stack goes to sleep, and is woken by the tick. All is well.
    Processing done in pretick does not finish in time (e.g. due to a sparse channel map, that requires more processing). The assert checking whether tick has occured triggers.

    The following possible workarounds are suggested:
    Change the external controller so that it will always use an advertising interval in the range (40ms..4s) and restart the 8001 if the active signal is absent for more than ~5s.

    Since the ACTIVE signal is not available when the advertising interval is less than 30ms. The work around that depends on the ACTIVE line cannot be treated as a complete work around. In this case a timer can be used to check if the nRF8001 is alive during the advertisement phase i.e. ACI Connect or ACI Bond. This timer can last 2 seconds longer than the timeout used in the ACI Connect or ACI Bond. When the timer expires it will check if the ACI Connected Event OR ACI Disconnected Event (Reason = Timed out) was received. If any of the two events were received the nRF8001 is operating normally. If no event was received, then the nRF8001 needs to be pin reset to recover and continue advertising. The nRF8001 is pin reset by holding the RESET line of the nRF8001 low for at least 200ns and then making it high.

    Caveat:
    When the ACI Connect timeout is used with an infinite timeout, this work around cannot be used. It is suggested that ACI connect with infinite timeout be broken up into finite timeouts that are repeated. For example if

    ACI Connect (0 /** Infinite timeout /, advInterval / advertising interval **/ ) is used

    it can be split up into

    ACI Connect ( timeout /* in seconds /, advInterval / advertising interval */ )

    and this can be called again after the Advertising timed out every timeout seconds.

    This caveat does not exist with ACI Bond as the maximum timeout value for ACI bond is 180 seconds.

  • Hello run_ar,

    Thank you for the detailed answer.

    Our system is set to clock out events as soon as possible at a speed of about 60 kHz (fCLK). Therefore, the HardwareErrorEvent following the DeviceStartedEvent is clocked out immediately. The clock signal is in a low state for about 0.2 ms between the two events.

    Up to now, we didn’t experience problems with these settings. At least we know now how to handle it if there were an infinite loop after a hardware error event.

    As you can see on the traces captured with a logic analyzer, the active signal is not asserted during advertising (advertising interval 100 ms). Is this normal behaviour? So, the workaround based on the active signal might not work correctly.

    The second suggestion using advertising with timeout should be OK. Currently, we often have to update the data in advertising packets, and we can detect the problem earlier if the nRF8001 doesn’t pull low the RDYN line after the REQN line goes low.

    Best regards

Related