Central and peripheral project fails to start DFU depending on order devices connect in.

Hi,

My project consists of 2 devices, both using the nrf52840 on sdk 17.1. One I will call the hub, the other the sensor. The hub serves as the central in a connection with the sensor, and the hub can simultaneously act as a peripheral in a connection to an iOS app. The hub can receive an OTA update from the iOS app. However, depending on if the hub established its connection with the sensor, or its connection with the phone first, the DFU can fail to initiate.

If the hub first connects to the sensor (with the hub acting as central), and then connects to the phone (with the hub as peripheral) the DFU process works just fine. 

If the hub first connects to the phone, then to the sensor, then the DFU will fail to start. When it fails it is because the hub doesn't respond to an indication from the phone on the DFU service. From the phone, I see 

Message: peripheral .writeValue (0x02084466753835353330, for: 8EC90003-F315-4F60-9FB8-838830DAEA50, type: .withResponse)

and then exactly 30s later, the phone disconnects, meaning the connection was terminated because the hub didn't respond to the write with response.

On the hub in this case, I  see the following:

<debug> ble_scan: Filter set on address 0x

<debug> ble_scan: 70 15 91 8B 73 E7 |p...s.

<debug> ble_scan: Scanning

<debug> nrf_ble_gq: Adding item to the request queue

<debug> nrf_ble_gq: Purging request queue with id: 1

<debug> nrf_ble_gq: GATTS Notification or Indication

<debug> nrf_ble_gq: SD GATT procedure (5) succeeded on connection handle: 3.

<debug> nrf_ble_gq: Processing the request queue...

<debug> nrf_ble_gq: Processing the request queue...

<debug> nrf_ble_gq: Processing the request queue...

PERIPHERAL: Disconnected, handle 3, reason 0x13.

In this case (failure) the line "Purging request queue with id: 1" refers to the sensors disconnecting themselves which is a process I trigger. So purging their request queues makes sense (connection handle 1 since the sensor connected after the phone).

Is there some strange interplay of the peripheral and central modules with the DFU service that makes connection handle matter? Does the hub's connection with the phone need to be the last in some internal array due to some flawed logic somewhere? I am at a bit of a loss for where to even start investigating so any guidance would be appreciated. Unfortunately I can't really work around this issue because as the hub/sensor network expands I won't be able to control what order everything connects in

  • Hi

    I do not see any indication of an error here in the log (so it does not really match the 30 second timeout you see on the pone). Is the last line related to the connection with the phone? (Based on the log lin eit is in the peripheral role and with a differetn connection handle - 3 in this case). If so, the disconnect reason 0x13 is BLE_HCI_REMOTE_USER_TERMINATED_CONNECTION, which does not indicate an error, but rather a deliberate disconnect (with that as a reason). So as far as I can see all looks good up to this point? What happens after this?

    Is there some strange interplay of the peripheral and central modules with the DFU service that makes connection handle matter?

    It should not matter and I do not see any indication of the cause in this information. Can you add a bit more loggign and debug more to understand more about what happens? Why does the disconnect seems successfull on the nRF but cause a timeout on the mobile side? And does the nRF reset into the bootloader even though the app does not connect, or does it not rest (I would assume no). If not it would make sense to see what happens in the buttonless DFU service implementation to see at which point between writing to the characterisitc and performing the reset things go wrong (and after that we can look at why).

  • I am unable to add more logging, as I have maximum log level enabled in the SDK, and I am unsure what module to look into to see where the message gets lost.

    I see how the logs I shared can be confusing since there are no timestamps, but the line "PERIPHERAL: Disconnected, handle 3, reason 0x13." occurs 30s after the "Message: peripheral .writeValue (0x02084466753835353330, for: 8EC90003-F315-4F60-9FB8-838830DAEA50, type: .withResponse)". The BLE_HCI_REMOTE_USER_TERMINATED_CONNECTION is because the phone terminated the connection after 30s. The phone is terminating per BLE spec which dictates a 30s timeout to a write with response.

    To try to simplify what I am describing, in the situation where the hub connects to the phone (with hub acting as peripheral) first, and then the hub connects to the sensor (with the hub acting as central), the following occurs:

    1. Phone issues write with response to the hub to initiate the DFU

    2. Hub never responds to the write command

    3. 30s later the phone terminates the connection per BLE spec

    4. Hub reconnects and the DFU starts

    I get the same behavior if the hub connects to the phone, then then sensor, then the sensor is disconnected. If I restart the hub before #3 occurs, the hub reconnects and the DFU starts, so the hub is receiving the message, it just never acknowledges it and the communication hangs since the phone issues a write with response. Establishing any additional connection after establishing the connection with the phone breaks the hubs ability to respond to the DFU request from the phone, the DFU service never gets any events

    If the hub connects first to the sensor, then to the phone, or the hub connects to the phone, and never connects to the sensor, then what I see is

    1. Phone issues write with response to the hub to initiate the DFU

    2. Hub responds to the write command

    3. DFU starts immediately 

    There seems to the some communication breakdown between the hub and phone as a result of the Sensor being connected after the phone. I can replicate the behavior using the NRF Connect iOS app. Essentially the DFU target (hub) fails to respond to write with response, and I must rely on the write request timeout, or a manual reset, for the DFU to start.

  • Hi,

    I see. It could be interesting with a sniffer trace to see exactly wat happens on air. That said, I would suggest you add some logging and/or debug in the buttonless DFU service in soem other way. The most relevant here is components/ble/ble_services/ble_dfu/ble_dfu.c, and particularily on_ctrlpt_write(). Is this called at all in the failing case after writing from the phone? If no, we need to look into what could go wrong before that point (then the sniffer trace would be particularily interesting). And if yes, then it would for instance be intersting to see if the call to sd_ble_gatts_rw_authorize_reply() ever succeeeds.

  • Hi Einar,

    Your suggestion to look into on_ctrlpt_write() set me on the right path, and I figured out what the issue is, so thank you for the help, it is much appreciated!

    In the failing case, on_ctrlpt_write() was not firing. It should be called by on_rw_authorize_req(), but is blocked by the following

    if (p_ble_evt->evt.gatts_evt.conn_handle != m_dfu.conn_handle)
    {
        return;
    }


    In the failing case, the conn_handle that initiates the DFU (the phone) does not match m_dfu.conn_handle. This is because of the following two functions

    /**@brief Connect event handler.
     *
     * @param[in]   p_ble_evt   Event received from the BLE stack.
     */
    static void on_connect(ble_evt_t const * p_ble_evt)
    {
        m_dfu.conn_handle = p_ble_evt->evt.gap_evt.conn_handle;
    }
    
    
    /**@brief Disconnect event handler.
     *
     * @param[in]   p_ble_evt   Event received from the BLE stack.
     */
    static void on_disconnect(ble_evt_t const * p_ble_evt)
    {
        if (m_dfu.conn_handle != p_ble_evt->evt.gap_evt.conn_handle)
        {
            return;
        }
    
        m_dfu.conn_handle = BLE_CONN_HANDLE_INVALID;
    }


    In the failing case, it is because a sensor connects the the hub after the phone, so on_connect() fires and changes m_dfu.conn_handle, meaning the phone can longer initiate the DFU. If a sensor disconnects, the same thing happens. Essentially the service as written is really only setup for a system with 1 possible connection, as additional connections will always modify m_dfu.conn_handle. It is not written to be used for a multi-link system


    I thought about modifying the service to work better with a multi-link setup, however the file contains this at the top
    /* Attention!
     *  To maintain compliance with Nordic Semiconductor ASA's Bluetooth profile
     *  qualification listings, this section of source code must not be modified.
     */
    


    Since I do not want to break compliance and risk needing additional certification testing, I have a work around. I already have a bit of custom overhead before starting a DFU, so as part of it I will terminate all connections, and then spoof a ble_evt_t object to pass into ble_dfu_buttonless_on_ble_evt() to trigger the on_connect() function and thus set m_dfu.conn_handle correctly

    It is unfortunate that the dfu_service provided as part of the SDK is not setup to work in a multi-link system, but I understand you guys can't write the sdk in a way that covers every possible use case. I appreciate you helping me troubleshoot the issue though, and hopefully this post is of use to someone else at some point.

  • Hi,

    I am glad to hear you found the cause and also found a viable workaround. An thank you for the detailed description, that can definetly be usefull for others.

Related