This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Losing Indication Confirmation from Central Device

Hi,

We have encountered an issue where we are transferring data using our custom service back and forth from the nrf52811 to a remote test device and the radio becomes unresponsive. The test involves the remote device sending data to our service as an ATT_WRITE_REQ, and the nrf52811 responding with more data via ATT_HANDLE_VALUE_IND. These transactions are happening repeatedly during a single connection. Data is being buffered on the nrf52811 in both directions since the transfers are larger than the MTU. The test will continue to connect, request data from the nRF52811 peripheral, then disconnect repeatedly for as long as the devices can hold up.

In the attached trace, the last thing we see is that the nRF82811 send and indication at packet 15052, to which the remote device confirms, then sends an ATT_WRITE_REQ with a portion of the data, but the nRF52811 does not respond to this event with the ATT_WRITE_RESP. It appears that the nRF52811 application is waiting to see the indication confirmation, but it never does and gets stuck there. At this point the radio becomes completely unresponsive. We cannot communicate to it over the air nor via UART to our printer host device.

We are using SDK 16.0.

NoAttWriteRsp.cfa

Parents
  • Thank you for the input, !

    Just to be completely sure, the nRF is the peripheral, and the one with address 00:07.4d:a8:e8:ed, right? (The device sending packet 15052). I am sorry if that is a stupid question, but I am not that familiar with the frontline sniffer. 

    I agree with , that 15052 is acked by the other device, but packet 15056 is not, which you can tell from the SN and NESN sequence. Are these the last packets over the air, or did you cut the trace there? I don't see any timestamps in the trace (but that may be because I don't know where they are written, or something weird with my setup), but I assume the connection has not timed out that quick. It should be able to handle some packet loss. 

    I also received the log from the nRF which you sent to my colleague via email. It is one of the "parser service event" lines that you are missing? What are the numbers at the end of these lines?

    So your application is still running at this point in time, where you lack the ATT_WRITE_REQ reply? Do you see this event on the nRF? And if so, when you reply, what function do you use to reply, and what does it return?

    There doesn't appear to be an app_error, based on the log. You didn't remove that from the log or the logging module? Since you are using the nRF52811, I assume that you are using an SDK version that will log this automatically. 

    What SDK and Softdevice version do you use?

    Do you see any hardfault? (Processor stopping while debugging, and the callstack is pointing somewhere around 0x0a60)

    If it is not too much to ask, is it possible to capture a sniffer trace using the nRF Sniffer? I am not saying it is better, but in Nordic we are a bit more familiar with it, and know where to look. 

    Edit:

    If the packets stop at the end of the sniffer trace. Do you see a disconnected event in any of the devices, the central or the peripheral? If so, what is the disconnect reason? If it is on the nRF, you can check it using:

    NRF_LOG_INFO("disconnected, reason: 0x%02x", p_ble_evt->evt.gap_evt.params.disconnected.reason);

    in the BLE_GAP_EVT_DISCONNECTED event in the ble_evt_handler().

    If you see it on the other device (central), I guess they have their own SDK, if it is not a Nordic device.

    Best regards,

    Edvin

  • We are using SDK 16 with Soft Device 113 version 7.0.1

    Yes, the nRF peripheral is address 00:07.4d:a8:e8:ed.

    The trace I sent is the entire over-the-air transaction. It was not stopped nor cut off. The "Parser Service Events" you see in the RTT ouput are the event we see in our custom service event handler, which are being output from its BLE event interrupt handler. They are output in hexadecimal. Nothing was removed from the logs. Everything we provided was the complete output from the radio and remote device. We do not get a hardfault at the point where we believe the application and/or soft device has crashed. 

    We do not know if our application is running when the ATT_WRITE_REQ is missed, but it appears not to be, since the event never gets to the service interrupt handler. We do not see the disconnect event in the app since it appears to be hung by the time the remote device times out the connection. The other device is a Linux-based test application that is not using a Nordic radio.

    We have attempted to use the nRF Sniffer in the past with varied success. We can try again if that would be mroe helpful to you.

    I think I have answered all of your questions. If I have not or you need more information please let me know.

    Best Regards,

    Drew

  • I have tried to use the nRF Sniffer but it can only decode the ATT packets on the single connection where it captures the LTK during pairing. Since our test script involves multiple connects and disconnects, we will need to continue using the Frontline snifffer.

    Regards,

    Drew

  • amarchand said:
    We have attempted to use the nRF Sniffer in the past with varied success. We can try again if that would be mroe helpful to you.

     As I said, it is at least more familiar. Sorry for the inconvenience.

    I noticed the timestamps in the trace (I had to scroll to the right. Sorry for not noticing this). I had also set the LE DATA filter, which stripped away the advertisements after packet 15059, which was why I suspected the cut in the trace. 

    Correct me if I am wrong here:

    15052: Indication from nRF -> central.

    15053: Acking indication (?) central -> nRF (acking the packtet 15052)

    15054: empty packet nRF -> central 

    15055: some data (Do you know what this is?) central -> nRF

    15056: empty packet nRF -> central(not acked)

    15057: retransmission of 15055 central -> nRF (because it didn't pick up the Ack from nRF in the previous packet)

    15058: empty packet nRF -> central

    15059: empty packet central -> nRF

    Pause 4 seconds

    15060 -> ... advertisements from nRF. 

    So two things happen at the end here. There i a retransmission of a packet, but that shouldn't matter. Then the central sends a packet that the nRF doesn't reply to (15059). It is possible that either the nRF doesn't pick up the packet from the central, or the sniffer didn't pick up the packet from the nRF. Either way, it is radio silence from then on, and for 4 seconds (exactly). Is this your supervision timeout?

    I suspect the central for being guilty. In a BLE connection it is always the central that is the responsible for sending the first packet. I don't know the HW of this sniffer, but the sniffer doesn't pick up any more packets from the central, and apparently, not the nRF either.

    since it doesn't have any packets to reply to, the nRF doesn't send anything until the connection times out (4 seconds), and then it starts advertising again after the connection timed out. 

    However, this should give the disconnected event in the nRF. If you don't see it, is it possible that you are waiting for some specific event in the ble_evt_handler() blocking the disconnected event?

    If the case is that the nRF crashes (hardfault or app error), the central should keep sending empty packets for the duration of the supervision timeout. I don't know what changes you may have done to the SDK, but the APP_ERROR_CHECK(err_code) should print something in the log when it receives an err_code != 0. Do you see that? I don't see it from the log, but I don't know if you actively removed it. You can check what it looks like by putting:
    APP_ERROR_CHECK(1); somewhere in your project after you initialize your log. 

    I would check the central in the connection. I find it weird that the central goes silent. You can test this from sniffing. If they are in a connection, and you cut the power on the peripheral, you will see that the central will keep sending messages. If you cut the power on the central, you will see that the link goes silent, because the peripheral doesn't have anything to reply to. Then, after the supervision timeout, it will restart advertising.

    Best regards,

    Edvin

  • The point at which the nRF does not respond to packet 15059 the where it appears to have crashed, or at least become completely unresponsive. As seen in the logs and trace, it will not respond via Bluetooth, nor will it respond to commands over the UART.

    We don't make any changes to the SDK, so the APP_ERROR_CHECK should print if it fails.

    At this point we have made significant changes to both our nRF application and the central device, so I will provide an update once we are able to conduct more testing.

    Regards,

    Drew

  • Attached is a set of logs and traces from our updated devices. At the end of the logs/traces, it can be seen that the central device sends a ATT_WRITE_REQ to the nRF. The nRF app is able to process the incoming data, but it never responds to the ATT_WRITE_REQ with the corresponding ATT_WRITE_RESP, which causes the central device to time out and disconnect.

    It is not clear why the nRF is unable to respond to the ATT_WRITE_REQ.

    missing_write_response.zip

Reply Children
Related