nRF52840 Possible Flash Damage

I seemed to be randomly causing problems to our nRF52840 Zephyr v2.02 sensor's flash.

Previously I had bricked 3 sensors somehow. Before full production, I need to know what I am doing wrong and how is this happening.

How many times can the flash be written?

I update the sensor by BLE (not SMP) from our nRF52840 Gateway that is connected by its host processor to the internet for image download.

With my current code I can update one of sensors but the other one I cannot. It seems to give a hard fault when img_mgmt_read_info() or img_mgmt_impl_write_pending() are called after the whole new image has been written. I have not been able to debug the reason since the BLE is running. I tried to call bt_disable(); before the img_mgmt calls but the program then crashes.

I can program both sensors with the merged.hex file by way of the SDK or the debugger.

The "good" sensor DFU update always goes OK and does not revert to the old image when reset.

My DFU process:

 In my main at startup I call ami_dfu_Init() which allocates memory.

After the Gateway connects when performing an upgrade, I call ami_dfu_Start() which erases the flash's slot and initializes streaming.

I use a ring buffer to receive the image's packets [ami_dfu_ReceivingPacket()} and flash only accumulated 512 byte chucks [ami_dfu_HandleAndRespond()].

Would it be better to use larger chucks to flash or is it possible? (Just another question but the problem).

I use img_mgmt_impl_write_image_data() to write the data.

When the entire image has been written, I call as shown below and then reset.

            //-----------------------------------------------------------------
            // All of the image was written?
            //-----------------------------------------------------------------
            if (dfu_state->FileWrOffset>=dfu_state->FileSize.ulong){
              int image_slot = 1;
              struct image_version ver;
              uint8_t  hash;
              uint32_t flags;
              bool permanent = false;
// bt_disable();// ?!?              
              err = img_mgmt_read_info(image_slot,&ver,&hash,&flags);
              if (err==0){
                err = img_mgmt_impl_write_pending(image_slot,permanent);
                if (err==0){
                  d_printf(LINE_INFO,kDbg_Error|kDbg_General,"FinishedDFU");
                }else{
                      dfu_state->err = AMI_DFU_CENTRAL_ERR_SET_PENDING;                
                      d_printf(LINE_INFO,kDbg_Error|kDbg_General,"FinishedDFU ErrSetPending");
                }
              }else{
                    dfu_state->err = AMI_DFU_CENTRAL_ERR_RD_INFO;
                    d_printf(LINE_INFO,kDbg_Error|kDbg_General,"ImageInfoErr:%d",err);
              }
              //---------------------------------------------------------------
              // Reset after reply
              //---------------------------------------------------------------
              ami_state->reset_secs = 4;
              break;
            }

Here is the whole file:

#include <stddef.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#include <kernel.h>
#include <zephyr.h>

#include <sys/printk.h>

#include <zcbor_encode.h>
#include <zcbor_decode.h>
#include <zcbor_common.h>

#include <zephyr/bluetooth/bluetooth.h>
#include <zephyr/bluetooth/hci.h>
#include <zephyr/bluetooth/hci_vs.h>
#include <zephyr/bluetooth/conn.h>
#include <zephyr/bluetooth/uuid.h>
#include <zephyr/bluetooth/gatt.h>
#include <bluetooth/gatt_dm.h>
#include <zephyr/sys/byteorder.h>
#include <bluetooth/services/dfu_smp.h>
#include <sys/reboot.h>
#include <zephyr/sys/ring_buffer.h>

#include <dfu/mcuboot.h>
#include <zephyr/dfu/mcuboot.h>
#include <dfu/dfu_target.h>
#include <dfu/dfu_target_stream.h>
#include <dfu/dfu_target_mcuboot.h>

#include <img_mgmt/img_mgmt_impl.h>
#include <img_mgmt/img_mgmt.h>
#include <img_mgmt/image.h>
#include <storage/flash_map.h>
#include <dfu/flash_img.h>
#include <devicetree.h>

#if FLASH_AREA_LABEL_EXISTS(image_1)
	#define UPLOAD_FLASH_AREA_ID FLASH_AREA_ID(image_1)
#endif


#include "aqua_def.h"
#include "d_printf.h"
#include "ami_ble.h"
#include "ami_main_serv.h"
#include "ami_setup_serv.h"
#include "ami_learn_serv.h"
#include "ami_transfer_serv.h"
#include "ami_timer.h"
#include "ami_dfu.h"
#include "ami_watchdog.h"

//-----------------------------------------------------------------------------
// define Ring buffers
//-----------------------------------------------------------------------------
RING_BUF_ITEM_DECLARE_SIZE(dfu_ring_buf, AMI_DFU_RING_BUF_SIZE);

//-----------------------------------------------------------------------------
// Variables
//-----------------------------------------------------------------------------
ami_dfu_state_struct    *dfu_state = NULL;
static struct	flash_img_context flash_img;

//-----------------------------------------------------------------------------
// ami_dfu_ReceivingPacket
//  Description
//   Parses an incoming BLE image packet
//  Parameters
//  Returns
//   0 = OK
//-----------------------------------------------------------------------------
int ami_dfu_ReceivingPacket(uint8_t *buf,uint16_t buf_pos,uint16_t len)
{
  //---------------------------------------------------------------------------
  // transfer_receive_cb
  //---------------------------------------------------------------------------
  int       err = 0;
  uint16_t  data_len;
  //---------------------------------------------------------------------------
  // FileRxOffset
  //---------------------------------------------------------------------------
  dfu_state->FileRxOffset.bytes[LSB_LO_BYTE] = buf[buf_pos++];
  dfu_state->FileRxOffset.bytes[LSB_HI_BYTE] = buf[buf_pos++];
  dfu_state->FileRxOffset.bytes[MSB_LO_BYTE] = buf[buf_pos++];
  dfu_state->FileRxOffset.bytes[MSB_HI_BYTE] = buf[buf_pos++];
  //---------------------------------------------------------------------------
  // If FileRxOffset==0 then FileSize sent
  //---------------------------------------------------------------------------
  if (!dfu_state->FileRxOffset.ulong){
    dfu_state->FileSize.bytes[LSB_LO_BYTE] = buf[buf_pos++];
    dfu_state->FileSize.bytes[LSB_HI_BYTE] = buf[buf_pos++];
    dfu_state->FileSize.bytes[MSB_LO_BYTE] = buf[buf_pos++];
    dfu_state->FileSize.bytes[MSB_HI_BYTE] = buf[buf_pos++];
    d_printf(LINE_INFO,kDbg_Error|kDbg_General,"FileSize %lu",dfu_state->FileSize.ulong);
  }
  data_len = (len-buf_pos);
  //---------------------------------------------------------------------------
  // Something to write to Flash 
  //---------------------------------------------------------------------------
  if (data_len){
    //-------------------------------------------------------------------------
    // Store image data 
    //-------------------------------------------------------------------------
    if (dfu_state->FileRingWrOffset==dfu_state->FileRxOffset.ulong){
      uint16_t num_put = ring_buf_put(&dfu_ring_buf,&buf[buf_pos],data_len);
      if (num_put!=data_len) {
        err            = AMI_DFU_CENTRAL_ERR_RING_OVERFLOW;
        dfu_state->err = AMI_DFU_CENTRAL_ERR_RING_OVERFLOW;
        d_printf(LINE_INFO,kDbg_Error|kDbg_General,"RingPut:%u Rx:%u Off:%ld",num_put,data_len,dfu_state->FileRxOffset.ulong);
        return err;
      }else{
            dfu_state->FileRingWrOffset += data_len;
      }
    }else{
          // This was a retry that the Central did not received an acknowledgemnet
    }
  }
  return err;
} // ami_dfu_ReceivingPacket

//-----------------------------------------------------------------------------
// ami_dfu_HandleAndRespond
//  Description
//   Handles the incoming DFU image block as acculmulated
//  Parameters
//  Returns
//   After subscribe - returns the State field which contains if slot erase and init were OK.
//   Otherwise the offset next to write
//-----------------------------------------------------------------------------
uint32_t ami_dfu_HandleAndRespond(void)
{
  static   uint8_t wr_buf[CONFIG_IMG_BLOCK_BUF_SIZE];
  bool     bLast      = false;
  //---------------------------------------------------------------------------
  // If there was a fatal error, just return the last error
  //---------------------------------------------------------------------------
  if (dfu_state->err){
    return dfu_state->err;
  }
  dfu_state->replies++;
  if (dfu_state->replies==1){
    return ami_state->State.ulong;
  }
  //---------------------------------------------------------------------------
  // All bytes were received?
  //---------------------------------------------------------------------------
  bool bRxFinished = (dfu_state->FileRingWrOffset>=dfu_state->FileSize.ulong);
  do{
      //-----------------------------------------------------------------------
      // Peek and see how many bytes accumulated exist
      //-----------------------------------------------------------------------
      uint32_t accum_bytes = ring_buf_size_get(&dfu_ring_buf);
      //-----------------------------------------------------------------------
      // Wait to write until we have CONFIG_IMG_BLOCK_BUF_SIZE bytes or last packet 
      //-----------------------------------------------------------------------
      if (accum_bytes<CONFIG_IMG_BLOCK_BUF_SIZE && !bRxFinished){
        break;
      }
      //-----------------------------------------------------------------------
      // Don't write more than CONFIG_IMG_BLOCK_BUF_SIZE bytes at a time
      //-----------------------------------------------------------------------
      if (accum_bytes>CONFIG_IMG_BLOCK_BUF_SIZE){
        accum_bytes = CONFIG_IMG_BLOCK_BUF_SIZE;
      }
      //-----------------------------------------------------------------------
      // Calculate the number of bytes left to write to flash
      //-----------------------------------------------------------------------
      uint32_t bytes_to_wr = (dfu_state->FileSize.ulong-dfu_state->FileWrOffset);
      //-----------------------------------------------------------------------
      // Last block to write?
      //-----------------------------------------------------------------------
      if (accum_bytes>=bytes_to_wr){
        accum_bytes = bytes_to_wr;
        bLast       = true;
        d_printf(LINE_INFO,kDbg_Error|kDbg_General,"DFU LastWr %u",accum_bytes);
      }
      //-----------------------------------------------------------------------
      // Get the bytes from the ring buffer
      //-----------------------------------------------------------------------
      uint32_t data_len   = ring_buf_get(&dfu_ring_buf,wr_buf,accum_bytes);
      //-----------------------------------------------------------------------
      // Nothing to write?
      //-----------------------------------------------------------------------
      if (!data_len){        
        break;
      }
      //-----------------------------------------------------------------------
      // Write the data to the flash
      //-----------------------------------------------------------------------
      int err = img_mgmt_impl_write_image_data(dfu_state->FileWrOffset,wr_buf,data_len,bLast);
      if (err!=0) {
        dfu_state->err = AMI_DFU_CENTRAL_ERR_WR_BLOCK;
        d_printf(LINE_INFO,kDbg_Error|kDbg_General,"ImageWrErr:%d",err);
        return dfu_state->err;
      }else{
            d_printf(LINE_INFO,kDbg_Info|kDbg_General,"FlashWr:%ld %u",dfu_state->FileWrOffset,data_len);
            //-----------------------------------------------------------------
            // Update last offset written
            //-----------------------------------------------------------------
            dfu_state->FileWrOffset += data_len;
            //-----------------------------------------------------------------
            // All of the image was written?
            //-----------------------------------------------------------------
            if (dfu_state->FileWrOffset>=dfu_state->FileSize.ulong){
              int image_slot = 1;
              struct image_version ver;
              uint8_t  hash;
              uint32_t flags;
              bool permanent = false;
// bt_disable();// ?!?              
              err = img_mgmt_read_info(image_slot,&ver,&hash,&flags);
              if (err==0){
                err = img_mgmt_impl_write_pending(image_slot,permanent);
                if (err==0){
                  d_printf(LINE_INFO,kDbg_Error|kDbg_General,"FinishedDFU");
                }else{
                      dfu_state->err = AMI_DFU_CENTRAL_ERR_SET_PENDING;                
                      d_printf(LINE_INFO,kDbg_Error|kDbg_General,"FinishedDFU ErrSetPending");
                }
              }else{
                    dfu_state->err = AMI_DFU_CENTRAL_ERR_RD_INFO;
                    d_printf(LINE_INFO,kDbg_Error|kDbg_General,"ImageInfoErr:%d",err);
              }
              //---------------------------------------------------------------
              // Reset after reply
              //---------------------------------------------------------------
              ami_state->reset_secs = 4;
              break;
            }
      }
  }while (bRxFinished);
  return dfu_state->FileRingWrOffset;
} // ami_dfu_HandleAndRespond

//-----------------------------------------------------------------------------
// ami_dfu_Start
//  Description
//   Inits the dfu state, erases slot and sets the dfu streaming buffer
//  Parameters
//  Returns
//  AMI_DFU_CENTRAL_ERR_NONE = OK
//-----------------------------------------------------------------------------
int ami_dfu_Start(void)
{
  //---------------------------------------------------------------------------
  // Clear dfu state
  //---------------------------------------------------------------------------
  memset(dfu_state,0,sizeof(ami_dfu_state_struct));
  //---------------------------------------------------------------------------
  // Reset dfu ring buffer
  //---------------------------------------------------------------------------
  ring_buf_reset(&dfu_ring_buf);
  //---------------------------------------------------------------------------
  // Erase upload slot
  //---------------------------------------------------------------------------
  int err = img_mgmt_impl_erase_slot();
  if (err!=0){
#ifdef TBD    
    //-------------------------------------------------------------------------
    // Reset after reply
    //-------------------------------------------------------------------------
    ami_state->reset_secs = 4;
#endif
    dfu_state->err = AMI_DFU_CENTRAL_ERR_ERASE;
    d_printf(LINE_INFO,kDbg_Error|kDbg_General,"ImageEraseErr:%d", err);
    return err;
  }      
  //---------------------------------------------------------------------------
  // Init Flash stream writer (CONFIG_IMG_BLOCK_BUF_SIZE)
  //---------------------------------------------------------------------------
  g_img_mgmt_state.area_id = UPLOAD_FLASH_AREA_ID;
  err = flash_img_init(&flash_img);
  if (err!=0){
    d_printf(LINE_INFO,kDbg_Error|kDbg_General,"FlashInitErr:%d", err);
    dfu_state->err = AMI_DFU_CENTRAL_ERR_IMAGE_INIT;
#ifdef TBD     
    //-------------------------------------------------------------------------
    // Reset after reply
    //-------------------------------------------------------------------------
    ami_state->reset_secs = 4;
#endif    
    return err;
  }
  g_img_mgmt_state.area_id = UPLOAD_FLASH_AREA_ID;
  return AMI_DFU_CENTRAL_ERR_NONE;
} // ami_dfu_Start

//-----------------------------------------------------------------------------
// ami_dfu_Init
//  Description
//   Alolocates memory
//  Parameters
//  Returns
//   0 = OK
//-----------------------------------------------------------------------------
int ami_dfu_Init(void)
{
  int err = 0;

  //---------------------------------------------------------------------------
  // Allocate memory
  //---------------------------------------------------------------------------
  dfu_state  = k_malloc(sizeof(ami_dfu_state_struct));
  while (!dfu_state);
  memset(dfu_state,0,sizeof(ami_dfu_state_struct));
  return err;
} // ami_dfu_Init

Also:

In my main file at startup, I check if the image is confirmed and the system reports it is not.

I do not know if this could cause a problem since at least when the debugger is running the code it always is never confirmed and the write confirmed does nothing.

  //---------------------------------------------------------------------------
  // DFU upgraded and not confirmed, confirm
  //---------------------------------------------------------------------------
  if (!boot_is_img_confirmed()){
    err = boot_write_img_confirmed();
      if (err) {
          d_printf(LINE_INFO,kDbg_Error|kDbg_Display,"BootConfirmErr:%d",err);
    }else{
          d_printf(LINE_INFO,kDbg_Error|kDbg_Display,"BootConfirmed");
    }
  }


Calling boot_is_img_confirmed() returns BOOT_FLAG_UNSET.
Calling boot_write_img_confirmed() calls boot_set_confirmed(); which calls boot_set_confirmed_multi(0);

In boot_set_confirmed_multi() state_primary_slot.magic=3 (BOOT_MAGIC_UNSET) and goes to done without doing anything!

   switch (state_primary_slot.magic) {
    case BOOT_MAGIC_GOOD:
        /* Confirm needed; proceed. */
        break;

    case BOOT_MAGIC_UNSET:
        /* Already confirmed. */
        goto done;

    case BOOT_MAGIC_BAD:
        /* Unexpected state. */
        rc = BOOT_EBADVECT;
        goto done;
    }

github.com/.../1066

  • Hi,

    How many times can the flash be written?

    From the PS for the 52840, it is specified that the device has 10,000 write/erase cycles for the nRF52840 flash. During normal circumstances you should not need to worry that you will reach this limit, as my colleague has answered in this devzone case. The linked case has a different use-case but I think the information there can be relevant for your project. 

    However since you state that you can update one of the sensors but not the other it could however indicate that it is a issue with the flash, but I can't say for sure. Do you think you've written that many cycles to the flash during the devices lifespan? Do you have a less used device you can test with, or do you only have one device available? 

    My DFU process:

    Am I correct in assuming that you've written your own custom SMP server? 

    Could you use nrfjprog to read out the flash-area with the DFU images from all 3 devices and compare them to verify that the images are different? 

    I would also like to suggest these two resources containing a lot of great samples with DFU and bootloaders in nRF Connect SDK just in case you haven't seen them already. They might not contain what you need to update the sensors, but I find them great to have available when developing anything that should use the related topics:

  • It does not seem likely that even with a software DFU bug I have exceeded writing 10,000 times.

    The 3 sensor I have bricked do not respond to Seggar connect tries so any software above it also gives an error, so it may be another problem and not the flash with those units.

    The sensor that gives a hard fault after I have written the entire image to Slot1 caused me to think that maybe some flash region could not be programmed. I wanted to debug the reason but was unable to switch the BLE off so it would not give timing faults when debugging. How can I prevent the BLE timing errors when debugging this issue?

    I can try to read the nrfjprog to read out the flash-area. The flash areas are defaults for the nR52840. 

    What address are slot1?

    I did not write a custom SMP server but use raw BLE transfers from our Gateway Central.

    But I do have the SMP stuff enabled for phone App DUF updates.

    I tried Simon's test SMP DFU code but could not get it to work and decided to try custom raw transfers without the SMP complicating things.

    Before I start the DFU, I call img_mgmt_impl_erase_slot() to erase slot1 and flash_img_init().

    Then, I use img_mgmt_impl_write_image_data() to write incoming 512-byte blocks of the image.

    When I have finished writing the whole image to slot 1, I call img_mgmt_read_info() and  img_mgmt_impl_write_pending() with the write pending permanent flag false. It is all in the code in my post.

    It seems to work.

    I guess the img_mgmt_impl_erase_slot() call counts as one of the 10,000 flash writes.

    I will get back to you.

    Thanks David

  • Hi,

    DavidKaplan said:
    What address are slot1?

    The first thing that runs starts at address 0x0. Depending on if you're using MCUboot and/or an Immutable bootloader (b0), the location will change.

    If you only have the application, slot1 should be starting at 0x0. If you have MCUboot in addition to the application, then MCUboot is what starts at 0x0, and the application (primary slot) starts after. If you're also using b0, then b0 starts at 0x0, then followed by MCUboot, then followed by the application

    You could get an indication of where the various partitions are located by navigating to the projects build folder and use ninja partition_manager_report to see where the slots are located. 

    DavidKaplan said:
    It does not seem likely that even with a software DFU bug I have exceeded writing 10,000 times.

    I had a closer look in the MCUboot documentation and I found this: https://developer.nordicsemi.com/nRF_Connect_SDK/doc/latest/mcuboot/design.html#swap-using-scratch. If you're using the scratch algorithm when flashing, there might be some scratch specific wear when your doing upgrades which dramatically reduces the number of cycles you can update the device. That number could be more plausible than 10 000. I suggest you have a look at that documentation to see if that could be a possible scenario or not

    DavidKaplan said:
    I will get back to you.

    Sounds like a plan. Just a heads up, I will be working on and off for the next couple of weeks due to the christmas holidays in Norway, so the response time might be a bit longer than usual

    Kind regards,
    Andreas

  • After reading the  MCUboot documentation you inked, I understand that the swap-using-scratch is the default and that it may leave us with only 256 upgrades.
    1) Does loaded a new program when debugging count?

    2)  I erase the slot on the sensor before starting its file transfer and do not use CONFIG_IMG_ERASE_PROGRESSIVELY=y. Would it be better to use progressive erasures and not erase it at the start?

    3) The default nRF52840 block size is 512 bytes. Would it help to increase it if possible?


    I added a child_image folder and copied the mcuboot's prog.conf file adding the CONFIG_BOOT_SWAP_USING_MOVE=y #define in a mcuboot.conf file.

    I understand that this should drastically increase the number of possible upgrades.

    The sensor has only a primary and a secondary slot, so I see no reason not to do this.

      flash_primary (0x100000 - 1024kB):
    +-------------------------------------------------+
    | 0x0: mcuboot (0xc000 - 48kB)                    |
    +---0xc000: mcuboot_primary (0x74000 - 464kB)-----+
    | 0xc000: mcuboot_pad (0x200 - 512B)              |
    +---0xc200: mcuboot_primary_app (0x73e00 - 463kB)-+
    | 0xc200: app (0x73e00 - 463kB)                   |
    +-------------------------------------------------+
    | 0x80000: mcuboot_secondary (0x74000 - 464kB)    |
    | 0xf4000: littlefs_storage (0x6000 - 24kB)       |
    | 0xfa000: nvs_storage (0x6000 - 24kB)            |
    +-------------------------------------------------+
    
      sram_primary (0x40000 - 256kB):
    +--------------------------------------------+
    | 0x20000000: sram_primary (0x40000 - 256kB) |
    +--------------------------------------------+


    I FOTA DFU two of my nRF52840 sensors. One failed at the last block written or when calling img_mgmt_read_info() and  img_mgmt_impl_write_pending(). The other one most always works.

    I connected each sensor to my dk and issued the nrfjprog --readcode 1.hex for the first sensor (failed) and nrfjprog --readcode 2.hex for the good sensor. I compared the files and see that they are different.

    2548.hex

    2.hex

    Thanks and happy holidays.

  • Hi,

    I will give an answer to some of your questions now today so you have something to work with, and I will come back to verify the items below that needs some more input tomorrow. 

    DavidKaplan said:
    After reading the  MCUboot documentation you inked, I understand that the swap-using-scratch is the default and that it may leave us with only 256 upgrades.

    Yes, that's correct for the case where you're performing DFU using scratch. However the default is swap using move if the SoC family is nRF, I am not 100% sure that is the case in your application if you've created it from scratch, so we'll have to look closer into this. Could you:

    1. Share your configurations (proj.conf and overlay files)?
    2. Could you navigate to <your app>/build/mcuboot/zephyr/.config and see what is enabled or not? As a sample here, you can see that by default the sample I've chosen uses move and not scratch.

    For reference, the sample where this .config is fetched from is this sample:https://github.com/hellesvik-nordic/samples_for_nrf_connect_sdk/tree/main/bootloader_samples/nrf5340/mcuboot_smp_ble_simultaneous

    Swapping using move should not cause the same issues as with scratch, so I might've been too fast in suggesting that the issue lies there. However, your configuration files should reveal if scratch or move is being used.

    DavidKaplan said:
    2)  I erase the slot on the sensor before starting its file transfer and do not use CONFIG_IMG_ERASE_PROGRESSIVELY=y. Would it be better to use progressive erasures and not erase it at the start?

    I will have to verify this,

    DavidKaplan said:

    3) The default nRF52840 block size is 512 bytes. Would it help to increase it if possible?

    As far as I know what is used is the slot size of 4k (8 blocks or 1 page) that is being section used for cycling the image. I believe it should be possible to increase this in multiples of a page size, but I need to look closer into if there exists any configs or if the 52840 has any limitations w.r.t. maximum slot sector size. As the documentation states under mcuboot: "Usual minimum size of 4k.."

    Kind regards,
    Andreas

Related