This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Bonding with FDS and DFU sometimes fail

Hi

I am using nRF52DK, SDK 15.0, SD 6.0.0, multiperipheral based example with secure bootloader, FDS (7 FDS_VIRTUAL_PAGES). Uploading new software works (not tested thorougly).

Project building with SES, then flashing with a script:

SET family= NRF52
SET application= C:\repo\pikflometr\multiperipheral\pca10040\s132\ses\Output\Release\Exe\ble_app_multiperipheral_pca10040_s132.hex
SET application-version= 1
SET soft_device= C:\repo\pikflometr\lib\wszystkoCzegoPotrzebujesz\s132_nrf52_6.0.0_softdevice.hex
SET bootloader= C:\repo\pikflometr\secure_bootloader\pca10040_ble\ses\Output\Release\Exe\secure_bootloader_ble_s132_pca10040.hex
SET bootloader-version= 1
SET bl-settings-version= 1
SET settings= settings_1.hex
SET dfu_client= dfu_client_1.hex
SET nrfutil_location= C:\repo\pikflometr\lib
cd %nrfutil_location%
nrfjprog.exe --family %family% --program %soft_device% --chiperase --verify
nrfutil settings generate --family %family% --application %application% --application-version %application-version% --bootloader-version %bootloader-version% --bl-settings-version %bl-settings-version% %settings%
mergehex -m %application% %settings% -o %dfu_client%
nrfjprog.exe --family %family% --program %dfu_client% --verify
nrfjprog.exe --reset --family %family% --program %bootloader% --verify

then F5 to begin debugging with SES. 

Sometimes (maybe 15% of times) when I try to pair (using master Androd device) I get:

<error> app: ERROR 2 [NRF_ERROR_SOFTDEVICE_NOT_ENABLED] at C:\repo\pikflometr\multiperipheral\main.c:1808 PC at: 0x00033AA7

which is from PM_EVT_ERROR_UNEXPECTED case

static void pm_evt_handler(pm_evt_t const * p_evt)
{
    ret_code_t err_code;
    switch(p_evt->evt_id)
    {

        case PM_EVT_ERROR_UNEXPECTED:
        NRF_LOG_RAW_INFO("PM_EVT_ERROR_UNEXPECTED \n");
            // Assert.
            APP_ERROR_CHECK(p_evt->params.error_unexpected.error);
            
...}

It tells me that NRF_ERROR_SOFTDEVICE_NOT_ENABLED  which is silly because I have already connected with an advertising device, so BLE stack is up and running.

After some digging I have found out that peer_data_delete_process() calls send_unexpected_error()

  

So it looks like sometimes the FDS module does not initialize.

I added some logs and here is what I got after booting:

FAILING CASE:

<info> app: __________________start
<info> app: ~~ ZMIANA NAZWY (na v1) W: gap_params_init
<info> app: Setting vector table to bootloader: 0x00078000
<info> app: Setting vector table to main app: 0x00026000
<info> app: zostaje domyslny adv_name
<error> app: bzdurne wartosci ts:0, cnt:0
<info> app: nie ma w ogole pomiarow offline, bo secret_word to: 26DB1 wobec 0xAAAAAAAA
<info> app: epoch 0
<info> app: battery_meas handler rusza
<info> app: reset_reason: 0x00000004.

WORKING CASE:

<info> app: __________________start
<info> app: ~~ ZMIANA NAZWY (na v1) W: gap_params_init
<info> app: Setting vector table to bootloader: 0x00078000
<info> app: Setting vector table to main app: 0x00026000
.......fds..FDS_EVT_INIT
.......fds..FDS_EVT_WRITE
.......fds..FDS_EVT_UPDATE
.......fds..FDS_EVT_DEL_RECORD
<info> app: zostaje domyslny adv_name
<error> app: bzdurne wartosci ts:0, cnt:0
<info> app: nie ma w ogole pomiarow offline, bo secret_word to: 26DB1 wobec 0xAAAAAAAA
<info> app: epoch 0
<info> app: battery_meas handler rusza
<info> app: reset_reason: 0x00000001.

So all the difference is the FDS event logs don't appear occassionaly.

The logs that makes difference between failing and working case are in peer_data_storage.c:

static void fds_evt_handler(fds_evt_t const * const p_fds_evt)
{
    pm_evt_t pds_evt =
    {
        .peer_id = file_id_to_peer_id(p_fds_evt->write.file_id)
    };

    switch (p_fds_evt->id)
    {   
        case FDS_EVT_INIT:
            NRF_LOG_RAW_INFO(".......fds..FDS_EVT_INIT\n");
        case FDS_EVT_WRITE:
            NRF_LOG_RAW_INFO(".......fds..FDS_EVT_WRITE\n");
        case FDS_EVT_UPDATE:
            NRF_LOG_RAW_INFO(".......fds..FDS_EVT_UPDATE\n");
        case FDS_EVT_DEL_RECORD:
            NRF_LOG_RAW_INFO(".......fds..FDS_EVT_DEL_RECORD\n");
            
.....}

---

Furthermore I have added those logs to observe the state of flash memory:

     NRF_LOG_INFO("pages_available %d open_records %d valid_records %d",
                   stat.pages_available,stat.open_records,stat.valid_records);            
     NRF_LOG_INFO("dirty_records %d words_reserved %d words_used %d",
                  stat.dirty_records,stat.words_reserved,stat.words_used);            
     NRF_LOG_INFO("largest_contig %d freeable_words %d corruption %d",
                  stat.largest_contig,stat.freeable_words,stat.corruption); 

And here is the output (called after booting before trying to pair):

FAILING CASE:

<info> app: pages_available 0 open_records 0 valid_records 0
<info> app: dirty_records 0 words_reserved 0 words_used 0
<info> app: largest_contig 0 freeable_words 0 corruption 0

WORKING CASE:

<info> app: pages_available 7 open_records 0 valid_records 0
<info> app: dirty_records 0 words_reserved 0 words_used 12
<info> app: largest_contig 1022 freeable_words 0 corruption 0

What more info do you need to guide me to solve this?

Thanks

Parents
  • Hi,

    I agree that it looks to be a problem with the FDS initialization. Have you tried to reserve the 7 flash pages in the bootloader so the can't get overwritten during DFU(Memory layout)? FDS may fail to free the pages aftewards if not. 

  • Hi

    Thank you for response.

    If you are talking about DFU_APP_DATA_RESERVED then yes, I have reserved it.

    I shouldn't have any negative effect but I changed if from 7 to 8 -both on app side and bootloader side. Now there is 

        #define FDS_VIRTUAL_PAGES 8

    in application project

    and

        #define DFU_APP_DATA_RESERVED               (CODE_PAGE_SIZE * 8)

    in bootloader project.

    The situation with not-initializing the FDS still happens occassionally, but resetting the board with RESET button does the job and with the next boot-up the FDS works.

    I thought there may be some kind of misconfiguration with the FDS or DFU. Right after (successfull) flashing and booting the fds_stat() shows:

     0> <info> app: pages_available 8 open_records 0 valid_records 0
     0> <info> app: dirty_records 0 words_reserved 0 words_used 14
     0> <info> app: largest_contig 1022 freeable_words 0 corruption 0

    after bonding with one device it shows:

     0> <info> app: pages_available 8 open_records 0 valid_records 3
     0> <info> app: dirty_records 0 words_reserved 0 words_used 52
     0> <info> app: largest_contig 1022 freeable_words 0 corruption 0

    then I prepared .zip package. New app differs with TWO bytes, "v2" instead of "v1" in log content and "board-v2" instead of "board-v1" in #define DEVICE_NAME - just so I know the new software is up and running.

    SET family= NRF52
    SET application= C:\repo\pikflometr\multiperipheral\pca10040\s132\ses\Output\Release\Exe\ble_app_multiperipheral_pca10040_s132.hex
    SET application-version=2
    SET hardware-version= 52
    SET sd-req= 0xA8
    SET private_key= C:\repo\pikflometr\lib\\private_key.pem
    SET nrfutil_location= C:\repo\pikflometr\lib\
    SET zipfile_name= p6-2.zip
    cd %nrfutil_location%
    nrfutil pkg generate --application %application% --application-version %application-version% --hw-version %hardware-version% --sd-req %sd-req% --key-file %private_key% %zipfile_name%

    after bluetooth DFU OTA sometimes I see:

    <info> app: pages_available 8 open_records 0 valid_records 4
    <info> app: dirty_records 3 words_reserved 0 words_used 82
    <info> app: largest_contig 1022 freeable_words 26 corruption 0

    or sometimes

    <info> app: pages_available 8 open_records 0 valid_records 4
    <info> app: dirty_records 2 words_reserved 0 words_used 78
    <info> app: largest_contig 1022 freeable_words 22 corruption 0

    Is it desired behaviour that number of words_used is bigger after the update? So do valid_records number?

    I made no writes to the flash memory. Just the:

    programming -> fds_stat() -> bonding -> fds_stat() ->  update -> fds_stat()

  • Hi,

    Yes, I meant DFU_APP_DATA_RESERVED and it should be either be equal or larger than the FDS_VIRTUAL_PAGES number defined by the main app. However, the fact that you can recover through a reset indicates that the problem is something else.

    It looks like none of the FDS events are coming through in the failing case. Is the return value from pm_init() passed to the APP_ERROR_CHECK() macro in your code?

    Pawel Niedbala said:
    Is it desired behaviour that number of words_used is bigger after the update? So do valid_records number?

    CCCD (notifications/indications configuration) are stored to flash by the peer manager so the client configuration can be retained across connections.  That may explain why the FDS stat is not always the same. E.g., if the phone disables indications on the buttonless characteristic, then enables it when it starts the DFU. 

  • Hi,

    I understand that CCCDs are stored in flash memory so it may change, but only when I change them, right? In my scenario I did not enabled nor disabled any indications/notifications. Just click bond and click DFU.

    As for return value from pm_init() - yes, its return value is passed to error checking. Moreover I print it over debug terminal: 

    static void peer_manager_init(bool erase_bonds)
    {
        switch(PEER_MANAGER_SETTING)
        {
            case NO_PAIRING_NO_BONDING:
                BOND = false;        MITM = false;        LESC = 0;        KEYPRESS = 0;
                IO_CAPS = BLE_GAP_IO_CAPS_NONE;        OOB = false;        MIN_KEY_SIZE = 0;        MAX_KEY_SIZE = 0;
                K_OWN_ENC = 0;        K_OWN_ID = 0;        K_PEER_ENC = 0;        K_PEER_INC = 0;        
            break;
            case PAIRING_NO_BONDING:
                BOND = false;        MITM = false;        LESC = 0;        KEYPRESS = 0;
                IO_CAPS = BLE_GAP_IO_CAPS_NONE;        OOB = false;        MIN_KEY_SIZE = 7;        MAX_KEY_SIZE = 16;
                K_OWN_ENC = 0;        K_OWN_ID = 0;        K_PEER_ENC = 0;        K_PEER_INC = 0;
            break;
            case JUST_WORKS_BONDING:
                BOND = true;        MITM = false;        LESC = 0;        KEYPRESS = 0;
                IO_CAPS = BLE_GAP_IO_CAPS_NONE;        OOB = false;        MIN_KEY_SIZE = 7;        MAX_KEY_SIZE = 16;
                K_OWN_ENC = 1;        K_OWN_ID = 1;        K_PEER_ENC = 1;        K_PEER_INC = 1;
            break;
            case PASSKEY_BONDING_DISPLAY:
                BOND = true;        MITM = true;        LESC = 0;        KEYPRESS = 0;
                IO_CAPS = BLE_GAP_IO_CAPS_DISPLAY_ONLY;        OOB = false;        MIN_KEY_SIZE = 7;        MAX_KEY_SIZE = 16;
                K_OWN_ENC = 1;        K_OWN_ID = 1;        K_PEER_ENC = 1;        K_PEER_INC = 1;
            break;       
        }
    
        ble_gap_sec_params_t sec_param;
        ret_code_t err_code;
        err_code = pm_init();
        APP_ERROR_CHECK(err_code); NRF_LOG_INFO("pm_init() return status :%d", err_code);
    
        if (erase_bonds)
        {
            pm_peers_delete();
            NRF_LOG_INFO("bonds deleted");
        }
        memset(&sec_param, 0, sizeof(ble_gap_sec_params_t));
        
        ...
    }

    Here is the output:

    <info> app: __________________start
    <info> app: ~~ ZMIANA NAZWY (na v1) W: gap_params_init
    <info> app: pm_init() return status :0
    <info> app: Setting vector table to bootloader: 0x00078000
    <info> app: Setting vector table to main app: 0x00026000
    <info> app: zostaje domyslny adv_name
    <error> app: bzdurne wartosci ts:0, cnt:0
    <info> app: nie ma w ogole pomiarow offline, bo secret_word to: 26DB1 wobec 0xAAAAAAAA
    <info> app: epoch 0
    <info> app: battery_meas handler rusza
    <info> app: reset_reason: 0x00000004.
    
    <info> app: adc handler rusza
    SPRAWDZENIE
    <info> app: pages_available 0 open_records 0 valid_records 0
    <info> app: dirty_records 0 words_reserved 0 words_used 0
    <info> app: largest_contig 0 freeable_words 0 corruption 0
    

    So the return value is 0, despite the FDS is not initialized properly (0 pages instead of 8).

  • Pawel Niedbala said:
    I understand that CCCDs are stored in flash memory so it may change, but only when I change them, right? In my scenario I did not enabled nor disabled any indications/notifications. Just click bond and click DFU.

    Yes, the phone shouldn't change the configuration automatically.  It could be related to gscm_local_database_has_changed() or pm_peer_rank_*() if they're being used in your code.

    I notice that reset reason is 0x4 when it fails, and 0x1 when it works in the logs you've posted. Is it like that every time?  

Reply
  • Pawel Niedbala said:
    I understand that CCCDs are stored in flash memory so it may change, but only when I change them, right? In my scenario I did not enabled nor disabled any indications/notifications. Just click bond and click DFU.

    Yes, the phone shouldn't change the configuration automatically.  It could be related to gscm_local_database_has_changed() or pm_peer_rank_*() if they're being used in your code.

    I notice that reset reason is 0x4 when it fails, and 0x1 when it works in the logs you've posted. Is it like that every time?  

Children
  • Hi! 

    The reset reason 0x4 occurs every time I reflash the device. 0x1 reason after every button reset. I don't see there is any dependency.

    Now after adding some code to the app (developing and refactoring other functionalities) I noticed something strange. I mean the application behaves differently depending on some logs being turned on or off.

    Logs are on when a predefined symbol is set. 

    Inside peer_manager.c

    ret_code_t pm_init(void)
    {
        ret_code_t err_code;
    
        err_code = pds_init();
        if (err_code != NRF_SUCCESS)
        {
            #if defined TEST_FDS_DFU_DEBUG_LOG
                NRF_LOG_INFO("err from: pds_init, code: %d", err_code);
            /* FIXME DFU err_code=134 */
            #endif
            return NRF_ERROR_INTERNAL;
        }
    
        err_code = pdb_init();
        if (err_code != NRF_SUCCESS)
        {
            #if defined TEST_FDS_DFU_DEBUG_LOG
                NRF_LOG_INFO("err from: pdb_init");
            #endif
            return NRF_ERROR_INTERNAL;
        }
    
        err_code = sm_init();
        if (err_code != NRF_SUCCESS)
        {
            #if defined TEST_FDS_DFU_DEBUG_LOG
                NRF_LOG_INFO("err from: sm_init");
            #endif
            return NRF_ERROR_INTERNAL;
        }
    
        err_code = smd_init();
        if (err_code != NRF_SUCCESS)
        {
            #if defined TEST_FDS_DFU_DEBUG_LOG
                NRF_LOG_INFO("err from: smd_init");
            #endif
            return NRF_ERROR_INTERNAL;
        }
    
        err_code = gcm_init();
        if (err_code != NRF_SUCCESS)
        {
            #if defined TEST_FDS_DFU_DEBUG_LOG
                NRF_LOG_INFO("err from: gcm_init");
            #endif
            return NRF_ERROR_INTERNAL;
        }
    
        err_code = gscm_init();
        if (err_code != NRF_SUCCESS)
        {
            #if defined TEST_FDS_DFU_DEBUG_LOG
                NRF_LOG_INFO("err from: gscm_init");
            #endif
            return NRF_ERROR_INTERNAL;
        }
    
        err_code = im_init();
        if (err_code != NRF_SUCCESS)
        {
            #if defined TEST_FDS_DFU_DEBUG_LOG
                NRF_LOG_INFO("err from: im_init");
            #endif
            return NRF_ERROR_INTERNAL;
        }
    
        internal_state_reset();
    
        m_peer_rank_initialized = false;
        m_module_initialized    = true;
    
        // If PM_PEER_RANKS_ENABLED is 0, these variables are unused.
        UNUSED_VARIABLE(m_peer_rank_initialized);
        UNUSED_VARIABLE(m_peer_rank_token);
        UNUSED_VARIABLE(m_current_highest_peer_rank);
        UNUSED_VARIABLE(m_highest_ranked_peer);
    
        return NRF_SUCCESS;
    }
    

    inside fds.c:

    // Initialize the filesystem.
    static ret_code_t init_execute(uint32_t prev_ret, fds_op_t * const p_op)
    {
        ret_code_t ret = FDS_ERR_INTERNAL;
        #if defined TEST_FDS_DFU_DEBUG_LOG
            NRF_LOG_RAW_INFO("init_execute: prev_ret %d, p_op %d\n", prev_ret, p_op->init.step);
        #endif
        if (prev_ret != NRF_SUCCESS)
        {
            // A previous operation has timed out.
            m_flags.initializing = false;
            return FDS_ERR_OPERATION_TIMEOUT;
        }
    
        switch (p_op->init.step)
        {
            case FDS_OP_INIT_TAG_SWAP:
            {
                // The page write offset was determined previously by pages_init().
                p_op->init.step = FDS_OP_INIT_TAG_DATA;
                ret             = page_tag_write_swap();
            } break;
    
            case FDS_OP_INIT_TAG_DATA:
            {
                // Tag remaining erased pages as data.
                bool write_reqd = false;
                for (uint16_t i = 0; i < FDS_DATA_PAGES; i++)
                {
                    if (m_pages[i].page_type == FDS_PAGE_ERASED)
                    {
                        m_pages[i].page_type = FDS_PAGE_DATA;
                        write_reqd           = true;
                        ret = page_tag_write_data(m_pages[i].p_addr);
                        break;
                    }
                }
                if (!write_reqd)
                {
                    m_flags.initialized  = true;
                    m_flags.initializing = false;
                    return FDS_OP_COMPLETED;
                }
            } break;
    
            case FDS_OP_INIT_ERASE_SWAP:
            {
                // If the swap is going to be discarded then reset its write_offset.
                p_op->init.step          = FDS_OP_INIT_TAG_SWAP;
                m_swap_page.write_offset = FDS_PAGE_TAG_SIZE;
    
                ret = nrf_fstorage_erase(&m_fs, (uint32_t)m_swap_page.p_addr, FDS_PHY_PAGES_IN_VPAGE, NULL);
            } break;
    
            case FDS_OP_INIT_PROMOTE_SWAP:
            {
                p_op->init.step       = FDS_OP_INIT_TAG_SWAP;
    
                // When promoting the swap, keep the write_offset set by pages_init().
                ret = page_tag_write_data(m_swap_page.p_addr);
    
                uint16_t const         gc         = m_gc.cur_page;
                uint32_t const * const p_old_swap = m_swap_page.p_addr;
    
                // Execute the swap.
                m_swap_page.p_addr = m_pages[gc].p_addr;
                m_pages[gc].p_addr = p_old_swap;
    
                // Copy the offset from the swap to the new page.
                m_pages[gc].write_offset = m_swap_page.write_offset;
                m_swap_page.write_offset = FDS_PAGE_TAG_SIZE;
    
                m_pages[gc].page_type = FDS_PAGE_DATA;
            } break;
    
            default:
                // Should not happen.
                break;
        }
        #if defined TEST_FDS_DFU_DEBUG_LOG
            NRF_LOG_RAW_INFO("init_execute: ret %d\n", ret);
        #endif
        if (ret != FDS_SUCCESS)
        {
            // fstorage queue was full.
            m_flags.initializing = false;
            return FDS_ERR_BUSY;
        }
    
        return FDS_OP_EXECUTING;
    }

    And here it is - when the logs are turned on the fds module initializes correctly on the first run. When they are turned off the device need 3 consecutive button resets until it starts working properly.

    I made hexdump before and after each reset in the failing case and below is the diff:

    just flashed against 1 reset: : 

    just flashed against 2 resets: 

    just flashed against 3 resets: 

    Is it some RAM issue? I am nearing to 94% RAM usage based on SES build output. Or some NRF_LOG delay?

  • Hi,

    Sorry, I thought you had actually to perform DFU to reproduce this. That made everything a bit more confusing. Your last three screenshot shows that FDS is only able to tag two pages at a time for the failing case, which explains why the extra resets help. Could you try to add a busy wait (e.g., nrf_delay_ms(10)) right after pm_init() as a test to see if solves the problem even with logging on?

    Also, logging in itself should have any impact on FDS, but it could be related to stack usage, as you said. You can use the Stack guard to catch any stack overflows in your code. 

    static inline void stack_guard_init(void)
    {
        APP_ERROR_CHECK(nrf_mpu_lib_init()); //<-- must be initalized for stack guard to trigger fault on stack overflow
        APP_ERROR_CHECK(nrf_stack_guard_init());
    }

  • Hi

    After adding busy wait before and after pm_init() the app seems to work correctly after each bootup, there is no need to reset the board. At least it has not happened yet. 

        nrf_delay_ms(10); 
        err_code = pm_init();
        APP_ERROR_CHECK(err_code); NRF_LOG_INFO("pm_init() return status :%d", err_code);
        nrf_delay_ms(10); 

    Can you comment why FDS may not be initializing properly sometimes without that delay? 

    I haven't tried using stack guard, is it really that helpful? Also for SDK15 it is nrf_mpu_init() not nrf_mpu_lib_init().

  • Hi,

    Ok, thanks for confirming. I think it's likely that the Softdevice fails to schedule the flash operations due to BLE advertising and connections (Flash memory API). Maybe you have a re-connect immediately on startup that triggers this scenario? 

    Do you start advertising directly or do you use ble_advertising_start() which should delay it until there are no pending flash operations? 

    uint32_t ble_advertising_start(ble_advertising_t * const p_advertising,
                                   ble_adv_mode_t            advertising_mode)
    {
        uint32_t ret;
    
        if (p_advertising->initialized == false)
        {
            return NRF_ERROR_INVALID_STATE;
        }
    
        p_advertising->adv_mode_current = advertising_mode;
    
        // Delay starting advertising until the flash operations are complete.
        if (flash_access_in_progress())
        {
            p_advertising->advertising_start_pending = true;
            return NRF_SUCCESS;
        }

     

    Pawel Niedbala said:
    I haven't tried using stack guard, is it really that helpful? Also for SDK15 it is nrf_mpu_init() not nrf_mpu_lib_init().

     It can help you detect stack overflows. Probably not necessary if you've already profiled the stack usage or know that you have plenty of space left.  

  • Indeed, I start advertising directly, advertising_start() which calls sd_ble_gap_adv_start(...). I do not even use advertising module. Also sd_ble_gap_connect(...) is not being called at all in my application.

    If your guess is correct the return code does not explain the reason properly and the error caught by APP_ERROR_CHECK leads to a misunderstanding.

Related