This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

FOTA Hangs

I’m using the latest version of the modem 1.2.0 and sdk v1.2.0 

For the most part AWS FOTA seems to be functioning.

However it will start the FOTA job and hang and fail to proceed until the device is rebooted, after which the originally initiated FOTA job will start and finish as expected. 

The following log shows what is happening and where the reboot is.

I'm unsure if this issue is related to the TLS cache  problem referred to in many other posts whereby a second attempt at communicating over a socket will fail.

Just eneed some guidance and help if possible please.

[00:05:02.673,980] [1B][0m<dbg> aws_fota.on_publish_evt: Received topic: $aws/things/ogw-352656100322588/jobs/notify-next[1B][0m

[00:05:02.684,936] [1B][0m<inf> aws_fota: Checking for an available job[1B][0m

[00:05:02.691,955] [1B][0m<dbg> aws_fota.get_job_execution: Job ID: 795[1B][0m

[00:05:02.698,516] [1B][0m<dbg> aws_fota.get_job_execution: hostname: ozark-iot-fota.s3-ap-southeast-2.amazonaws.com[1B][0m

[00:05:02.709,045] [1B][0m<dbg> aws_fota.get_job_execution: file_path app_update.bin[1B][0m

[00:05:02.716,735] [1B][0m<dbg> aws_fota.get_job_execution: execution_version_number: 1 [1B][0m

[00:05:02.724,853] [1B][0m<inf> aws_jobs: Subscribe: $aws/things/ogw-352656100322588/jobs/795/update/#[1B][0m

[00:05:03.989,715] [1B][0m<dbg> aws_fota.on_publish_evt: Received topic: $aws/things/ogw-352656100322588/jobs/notify-next[1B][0m

[00:05:04.000,671] [1B][0m<inf> aws_fota: Checking for an available job[1B][0m

[00:05:04.007,720] [1B][0m<dbg> aws_fota.get_job_execution: Job ID: 795[1B][0m

[00:05:04.014,282] [1B][0m<dbg> aws_fota.get_job_execution: hostname: ozark-iot-fota.s3-ap-southeast-2.amazonaws.com[1B][0m

[00:05:04.024,780] [1B][0m<dbg> aws_fota.get_job_execution: file_path app_update.bin[1B][0m

[00:05:04.032,501] [1B][0m<dbg> aws_fota.get_job_execution: execution_version_number: 1 [1B][0m

[00:05:04.040,618] [1B][0m<inf> aws_jobs: Subscribe: $aws/things/ogw-352656100322588/jobs/795/update/#[1B][0m

[00:05:04.483,184] [1B][0m<dbg> aws_fota.on_publish_evt: Received topic: $aws/things/ogw-352656100322588/jobs/notify-next[1B][0m

[00:05:04.494,140] [1B][0m<inf> aws_fota: Checking for an available job[1B][0m

[00:05:04.501,190] [1B][0m<dbg> aws_fota.get_job_execution: Job ID: 795[1B][0m

[00:05:04.507,751] [1B][0m<dbg> aws_fota.get_job_execution: hostname: ozark-iot-fota.s3-ap-southeast-2.amazonaws.com[1B][0m

[00:05:04.518,249] [1B][0m<dbg> aws_fota.get_job_execution: file_path app_update.bin[1B][0m

[00:05:04.525,939] [1B][0m<dbg> aws_fota.get_job_execution: execution_version_number: 1 [1B][0m

[00:05:04.534,088] [1B][0m<inf> aws_jobs: Subscribe: $aws/things/ogw-352656100322588/jobs/795/update/#[1B][0m

[00:05:04.957,000] [1B][0m<dbg> aws_fota.update_job_execution: update_job_execution, state: 1, status: 1, version_number: 1[1B][0m

[00:05:04.968,322] [1B][0m<inf> aws_jobs: Publish topic: $aws/things/ogw-352656100322588/jobs/795/update[1B][0m

[00:05:04.977,783] [1B][0m<inf> aws_jobs: Publish payload {"status":"IN_PROGRESS","statusDetails": {"nextState":"download_firmware","progress":0},"expectedVersion": "1","clientToken": ""}[1B][0m

 

 

Device powered off and rebooted here……,

 

[00]*** Booting Zephyr OS build v2.1.99-ncs1  ***

[00:00:00.004,211] [1B][0m<inf> mcuboot: Starting bootloader[1B][0m

[00:00:00.012,329] [1B][0m<inf> mcuboot: Primary image: magic=good, swap_type=0x2, copy_done=0x1, image_ok=0x3[1B][0m

[00:00:00.025,390] [1B][0m<inf> mcuboot: Boot source: none[1B][0m

[00:00:00.032,562] [1B][0m<inf> mcuboot: Swap type: revert[1B][0m

[00:00:00.432,189] [1B][0m<inf> mcuboot: Secondary image: magic=unset, swap_type=0x1, copy_done=0x3, image_ok=0x3[1B][0m

[00:00:22.045,471] [1B][0m<inf> mcuboot: Bootloader chainload address offset: 0xc000[1B][0m

[00:00:22.053,192] [1B][0m<inf> mcuboot: Jumping to the first image slot[1B][0m

*** Booting Zephyr OS build v2.1.99-ncs1  ***

Flash region                        Domain                 Permissions

00 0x00000 0x08000      Secure                  rwxl

01 0x08000 0x10000      Secure                  rwxl

02 0x10000 0x18000      Secure                  rwxl

03 0x18000 0x20000      Non-Secure         rwxl

04 0x20000 0x28000      Non-Secure         rwxl

05 0x28000 0x30000      Non-Secure         rwxl

06 0x30000 0x38000      Non-Secure         rwxl

07 0x38000 0x40000      Non-Secure         rwxl

08 0x40000 0x48000      Non-Secure         rwxl

09 0x48000 0x50000      Non-Secure         rwxl

10 0x50000 0x58000      Non-Secure         rwxl

11 0x58000 0x60000      Non-Secure         rwxl

12 0x60000 0x68000      Non-Secure         rwxl

13 0x68000 0x70000      Non-Secure         rwxl

14 0x70000 0x78000      Non-Secure         rwxl

15 0x78000 0x80000      Non-Secure         rwxl

16 0x80000 0x88000      Non-Secure         rwxl

17 0x88000 0x90000      Non-Secure         rwxl

18 0x90000 0x98000      Non-Secure         rwxl

19 0x98000 0xa0000      Non-Secure         rwxl

20 0xa0000 0xa8000      Non-Secure         rwxl

21 0xa8000 0xb0000      Non-Secure         rwxl

22 0xb0000 0xb8000      Non-Secure         rwxl

23 0xb8000 0xc0000       Non-Secure         rwxl

24 0xc0000 0xc8000       Non-Secure         rwxl

25 0xc8000 0xd0000       Non-Secure         rwxl

26 0xd0000 0xd8000      Non-Secure         rwxl

27 0xd8000 0xe0000      Non-Secure         rwxl

28 0xe0000 0xe8000      Non-Secure         rwxl

29 0xe8000 0xf0000        Non-Secure         rwxl

30 0xf0000 0xf8000         Non-Secure         rwxl

31 0xf8000 0x100000     Non-Secure         rwxl

Non-secure callable region 0 placed in flash region 2 with size 32.

 

 

SRAM region                      Domain                 Permissions

00 0x00000 0x02000       Secure                  rwxl

01 0x02000 0x04000       Secure                  rwxl

02 0x04000 0x06000       Secure                  rwxl

03 0x06000 0x08000       Secure                  rwxl

04 0x08000 0x0a000       Secure                  rwxl

05 0x0a000 0x0c000       Secure                  rwxl

06 0x0c000 0x0e000        Secure                  rwxl

07 0x0e000 0x10000       Secure                  rwxl

08 0x10000 0x12000       Non-Secure         rwxl

09 0x12000 0x14000       Non-Secure         rwxl

10 0x14000 0x16000       Non-Secure         rwxl

11 0x16000 0x18000       Non-Secure         rwxl

12 0x18000 0x1a000       Non-Secure         rwxl

13 0x1a000 0x1c000       Non-Secure         rwxl

14 0x1c000 0x1e000        Non-Secure         rwxl

15 0x1e000 0x20000       Non-Secure         rwxl

16 0x20000 0x22000       Non-Secure         rwxl

17 0x22000 0x24000       Non-Secure         rwxl

18 0x24000 0x26000       Non-Secure         rwxl

19 0x26000 0x28000       Non-Secure         rwxl

20 0x28000 0x2a000       Non-Secure         rwxl

21 0x2a000 0x2c000       Non-Secure         rwxl

22 0x2c000 0x2e000        Non-Secure         rwxl

23 0x2e000 0x30000       Non-Secure         rwxl

24 0x30000 0x32000       Non-Secure         rwxl

25 0x32000 0x34000       Non-Secure         rwxl

26 0x34000 0x36000       Non-Secure         rwxl

27 0x36000 0x38000       Non-Secure         rwxl

28 0x38000 0x3a000       Non-Secure         rwxl

29 0x3a000 0x3c000       Non-Secure         rwxl

30 0x3c000 0x3e000        Non-Secure         rwxl

31 0x3e000 0x40000       Non-Secure         rwxl

 

Peripheral                             Domain                 Status

00 NRF_P0               Non-Secure                 OK

01 NRF_CLOCK            Non-Secure          OK

02 NRF_RTC0             Non-Secure             OK

03 NRF_RTC1             Non-Secure             OK

04 NRF_NVMC             Non-Secure           OK

05 NRF_UARTE1           Non-Secure         OK

06 NRF_UARTE2           Non-Secure         OK

07 NRF_TWIM2            Non-Secure           OK

08 NRF_SPIM3            Non-Secure            OK

09 NRF_TIMER0           Non-Secure          OK

10 NRF_TIMER1           Non-Secure          OK

11 NRF_TIMER2           Non-Secure          OK

12 NRF_SAADC            Non-Secure          OK

13 NRF_PWM0             Non-Secure           OK

14 NRF_PWM1             Non-Secure           OK

15 NRF_PWM2             Non-Secure           OK

16 NRF_PWM3             Non-Secure           OK

17 NRF_WDT              Non-Secure             OK

18 NRF_IPC              Non-Secure                OK

19 NRF_VMC              Non-Secure             OK

20 NRF_FPU              Non-Secure              OK

21 NRF_EGU1             Non-Secure            OK

22 NRF_EGU2             Non-Secure            OK

23 NRF_DPPIC            Non-Secure            OK

24 NRF_GPIOTE1          Non-Secure         OK

25 NRF_REGULATORS       Non-Secure                 OK

 

SPM: NS image at 0x18200

SPM: NS MSP at 0x2002e520

SPM: NS reset vector at 0x1e1cd

SPM: prepare to jump to Non-Secure image.

*** Booting Zephyr OS build v2.1.99-ncs1  ***

Application started:9004

GPS:

client_id = ogw-352656100322588

nRF Cloud library initialized. 0

Establishing LTE link (this may take some time) ...

[00:00:01.764,984] [1B][0m<inf> lte_lc: PDP Context: AT+CGDCONT=0,"IP","m2m"[1B][0m

[00:00:01.772,033] [1B][0m<dbg> lte_lc.w_lte_lc_connect: Network mode: AT%XSYSTEMMODE=1,0,1,0[1B][0m

+CEREG: 2,"46C9","0006906F",7,0,0,"11100000","11100000"

[00:00:02.799,194] [1B][0m<dbg> lte_lc.parse_nw_reg_status: Network registration status: 2[1B][0m

+CEREG: 1,"46C9","0006906F",7,,,"00000001","10000100"

[00:00:04.981,964] [1B][0m<dbg> lte_lc.parse_nw_reg_status: Network registration status: 1[1B][0m

Connected to CAT M1 now:, err code: 0

Initialising GPS...

GPS initialized

AWS connection (connected: 0, in progress: 0)

Not connected to AWS, trying to connect

[00:00:09.224,517] [1B][0m<inf> aws_jobs: Subscribe: $aws/things/ogw-352656100322588/jobs/notify-next[1B][0m

[00:00:09.235,046] [1B][0m<inf> aws_jobs: Subscribe: $aws/things/ogw-352656100322588/jobs/$next/get/#[1B][0m

OZARK_CLOUD_EVT_TRANSPORT_CONNECTED

PSM enabled

[00:00:09.414,459] [1B][0m<inf> aws_fota: subscribed to notify-next topic[1B][0m

[00:00:09.421,264] [1B][0m<inf> aws_jobs: Publish topic: $aws/things/ogw-352656100322588/jobs/$next/get[1B][0m

[00:00:09.430,664] [1B][0m<inf> aws_jobs: Publish payload {"clientToken": ""}[1B][0m

[00:00:09.651,367] [1B][0m<dbg> aws_fota.on_publish_evt: Received topic: $aws/things/ogw-352656100322588/jobs/$next/get/accepted[1B][0m

[00:00:09.662,933] [1B][0m<inf> aws_fota: Checking for an available job[1B][0m

[00:00:09.670,043] [1B][0m<dbg> aws_fota.get_job_execution: Job ID: 795[1B][0m

[00:00:09.676,605] [1B][0m<dbg> aws_fota.get_job_execution: hostname: ozark-iot-fota.s3-ap-southeast-2.amazonaws.com[1B][0m

[00:00:09.687,133] [1B][0m<dbg> aws_fota.get_job_execution: file_path app_update.bin[1B][0m

[00:00:09.694,824] [1B][0m<dbg> aws_fota.get_job_execution: execution_version_number: 2 [1B][0m

[00:00:09.703,033] [1B][0m<inf> aws_jobs: Subscribe: $aws/things/ogw-352656100322588/jobs/795/update/#[1B][0m

[00:00:10.195,739] [1B][0m<dbg> aws_fota.update_job_execution: update_job_execution, state: 1, status: 1, version_number: 2[1B][0m

[00:00:10.207,061] [1B][0m<inf> aws_jobs: Publish topic: $aws/things/ogw-352656100322588/jobs/795/update[1B][0m

Parents
  • After spending several days on this now, I’m a lot closer to understanding what may be happening with FOTA, the CONFIG_MQTT_KEEPALIVE default value is 60 seconds.

    In order for us to use PSM effectively we must be able to extend this value for as long as possible. 

    During the testing of our GPS software I set it to 930 seconds (prj.conf) , anything longer than that seemed to cause socket reliability to deteriorate. 

    However when I have implemented FOTA and left CONFIG_MQTT_KEEPALIVE=930 I experience the issue whereby FOTA will start after waking from PSM if there is a job available and then hang, I can tell by looking at the current draw that PSM is still functioning, but nothing more is seen on the log output.

    A subsequent reboot will allow FOTA to complete. 

    If I reprogram the device and allow either the default for CONFIG_MQTT_KEEPALIVE to be set or manually set CONFIG_MQTT_KEEPALIVE=60 in prj.conf 

    FOTA seems to function properly.

    I would be very grateful for any advice you may be able to offer please.

  • Hi guys, 

    Don't know if my issue is related to this, as my CONFIG_MQTT_KEEPALIVE=1200 by default and I haven't changed it. But I'm also on SDK 1.2.0 and have seen the example asset_tracker hang when GPS is active or after GPS has been active (with modem firmware v1.1.1 and now also with v1.2.0). I'm connected to a NB-IOT network.

    I've noticed that:

    • While the modem is searching for satellites for about 360s, no firmware update requests are received (shouldn't LTE have priority here)?
    • Sometimes, upgrade requests are received and handled correctly, but only after the modem stops searching for sats
    • Other times, the app simply hangs after searching for sats, right after displaying "Enabling PSM" and never recovers. Shouldn't the watchdog reboot the app after some time? A manual reboot using the reset button does cause the app to receive and handle the firmware update later as requested.
    • Also, when the app does not hang, I can only upgrade the app image and not mcuboot even though I also have the immutable bootloader. The MCUBOOT images (s0 or s1) are downloaded by the app and the system reboots, then MCUBOOT tests the image, but upon a manual reset the MCUBOOT image is reverted and the MCUBOOT is not upgraded. (don't know if this is related to the app or MCUBOOT)
  • Sure, but it in no way helps fix the problem.

    I may be missing something but in order to implement that as a solution we would have to utilise persistent storage to notice that we needed to perform a reboot, after a reboot.

  • Hello, I am sorry for the delay in this ticket, I've been quite busy lately. I will prioritize it today and the upcoming working days.

    Best regards,

    Simon

  • Thank you very much Simon, very much appreciated.

  • I notice that your device doesn't receive the publish event "Job Id update accepted topic", and I am not sure why this is. I will ask internally about this and get back to you.

    Best regards,

    Simon

  • I talked to one of the developers and got some tips.

    First, could you change the way you are logging in order to get more information (it seems like you're using minimal)? If you use deferred logging with larger buffers you may get some more information.

    In addition, MQTT_KEEPALIVE is determined by the server as well, so you might want to look at that side if the session is shut down when it exceeds 60 seconds.

    Best regards,

    Simon

Reply
  • I talked to one of the developers and got some tips.

    First, could you change the way you are logging in order to get more information (it seems like you're using minimal)? If you use deferred logging with larger buffers you may get some more information.

    In addition, MQTT_KEEPALIVE is determined by the server as well, so you might want to look at that side if the session is shut down when it exceeds 60 seconds.

    Best regards,

    Simon

Children
  • Yes I probably agree.

    However, may I ask if you run the version of fota that exists in Master branch if you have the same experience that I'm having?

    I'm expecting that you will.

  • I will test it out, I have to set CONFIG_MQTT_KEEPALIVE=930 in addition I assume?

  • I am not able to reproduce your behavior with the aws_fota example in the master branch (commit 8f44fbc625f1ad).

    I set CONFIG_USE_CLOUD_CLIENT_ID=y in addition to CONFIG_CLOUD_CLIENT_ID, CONFIG_CLOUD_CERT_SEC_TAG and CONFIG_MQTT_BROKER_HOSTNAME to appropriate values. 

    I added a job and the sample successfully started a download immediately after:

    I: No queued jobs for this device
    I: Checking for an available job
    I: Subscribe: $aws/things/simon_client_id/jobs/fun_and_cool_job/update/#
    I: Subscribed to FOTA update topic $aws/things/simon_client_id/jobs/fun_and_cool_job/update/#
    I: Publish topic: $aws/things/simon_client_id/jobs/fun_and_cool_job/update
    I: Publish payload {"status":"IN_PROGRESS","statusDetails": null,"expectedVersion": "1","clientToken": ""}
    [mqtt_evt_handler:228] PUBACK packet id: 9062
    I: Start downloading firmware from simon-simon123.s3-us-west-2.amazonaws.com/app_update.bin
    I: Attempting to connect over IPv4
    I: Connected to simon-simon123.s3-us-west-2.amazonaws.com
    I: Downloading: app_update.bin [0]
    AWS_FOTA_EVT_START, job id = fun_and_cool_job
    I: Downloaded 4096/179380 bytes (2%)
    I: 2 Sectors of 4096 bytes
    I: alloc wra: 0, d18
    I: data wra: 0, 180
    I: Erasing sector at offset 0x00087000
    I: Erasing sector at offset 0x00088000
    I: Downloaded 8192/179380 bytes (4%)
    

    I also tried to set CONFIG_MQTT_KEEPALIVE=930, but the download was still successful. I'm not able to see the behavior you're explaining.

    Could you provide me the exact step to reproduce your behavior?

    Best regards,

    Simon

  • Thank you Simon.

    Now try and repeat the process a few times, without powering off the target board please.

    I would expect you to eventually see what I am seeing.

  • I tried to repeat the process, and I did not see the behavior. I might have misunderstood what you meant. I did the following:

    • Modified the aws_fota example as described in my last answer, with CONFIG_MQTT_KEEPALIVE=930
    • Uploaded the sample to a bucket and programmed the same aws_fota example onto the nrf9160
    1. Then I uploaded a job which contained the job file pointing to the uploaded aws_fota example
    2. Saw that the aws_fota sample recognized the job and started downloading the image. After the DFU was finished it would reset the chip automatically and run the newly updated aws_fota sample
    3. repeat step 1-3

    I ended 1,2,3 step-loop above after about 5-6 iterations and did not see the behavior you explained. Have I misunderstood? Should I repeat the process more? Could you provide a detailed step-by-step method of reproducing it? 

    You said, "without powering off the target board please". Did that include not resetting it? However, it will do that automatically when the download is completed.

    Best regards,

    Simon

Related