SoftDevice Controller ASSERT: 53, 296

Hi!

Can you please help me identify what is causing this issue and/or provide more info about this:

[00:04:23.038,330] \033[1;31m<err> bt_sdc_hci_driver: SoftDevice Controller ASSERT: 53, 296\033[0m
[00:04:57.431,335] \033[1;31m<err> os: ***** HARD FAULT *****\033[0m
[00:04:57.431,335] \033[1;31m<err> os: Fault escalation (see below)\033[0m
[00:04:57.431,365] \033[1;31m<err> os: ARCH_EXCEPT with reason 3
\033[0m
[00:04:57.431,365] \033[1;31m<err> os: r0/a1: 0x00000003 r1/a2: 0x00000000 r2/a3: 0x0001981b\033[0m
[00:04:57.431,396] \033[1;31m<err> os: r3/a4: 0x00000000 r12/ip: 0x20000ab0 r14/lr: 0xffffffff\033[0m
[00:04:57.431,396] \033[1;31m<err> os: xpsr: 0x41000011\033[0m
[00:04:57.431,396] \033[1;31m<err> os: r4/v1: 0x20015c70 r5/v2: 0x0003370d r6/v3: 0x0000000a\033[0m
[00:04:57.431,427] \033[1;31m<err> os: r7/v4: 0x20015c70 r8/v5: 0x20001944 r9/v6: 0x2000e204\033[0m
[00:04:57.431,427] \033[1;31m<err> os: r10/v7: 0x200008bc r11/v8: 0x00000000 psp: 0x20015f40\033[0m
[00:04:57.431,457] \033[1;31m<err> os: EXC_RETURN: 0xfffffff1\033[0m
[00:04:57.431,457] \033[1;31m<err> os: Faulting instruction address (r15/pc): 0x00033744\033[0m
[00:04:57.431,488] \033[1;31m<err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0\033[0m
[00:04:57.431,518] \033[1;31m<err> os: Fault during interrupt handling
\033[0m
[00:04:57.431,549] \033[1;31m<err> os: Current thread: 0x2000ae30 (unknown)\033[0m
[00:05:09.972,747] \033[1;31m<err> fatal_error: Resetting system\033[0m

When this happens it is in the process of uploading a new fw thru mcumgr image group, upload command.

I haven't been able to fully verify this but I think this affected by updating to SDK 2.6.0 (from 2.4.0).
That code part is unchanged before and after the SDK update except that "zcbor_new_decode_state" had two new parameter which we have set to NULL and 0 (after comparing how some examples had changed between sdk versions).

BR,
Mårten

Parents

0 Einar Thorsrud over 1 year ago

Hi Mårten,

The assert is triggered by an "overstay" event in the SoftDevice controller scheduling related to BLE central operation. So something has prevetned that code from running on time. That something can be higher priority threads or interrupts. I cannot say anything more specific based on the assert, but perhaps disconnect any central links when initiating DFU?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Maos over 1 year ago in reply to Einar Thorsrud

Hi!

Thanks for the reply.

Maybe wasn't that clear but the assert is happening on the central that is sending the image to a peripheral.
There is only one connection doing an update and possibly one more connection simultaneously (to different peripheral device) sending a smaller amount of data.

I have however managed to reproduce the issue with just a single peripheral.

Is there some "bluetooth" thread where the priority can be raised?

I guess this is deeper down and therefore not connected to what is done in the callback passed to bt_dfu_smp_command?

BR,
Mårten
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Einar Thorsrud over 1 year ago in reply to Maos

Hi Mårten,

I checked with our SoftDevice Controller team and they did not have any more specific ideas other than wondering if you (or a module you use) perhaps disables interrupts for too long (as in a critical section with irq_lock()/irq_unlock(). Could that be the case?

Br,

Einar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Maos over 1 year ago in reply to Einar Thorsrud

Hi!

Well we use the uart, nfc and bluetooth of the device and in the case of bluetooth scan for advertise data and if needed connect to the peripherial to either write a fairly small amount of data and/or do a OTA.
We only allow one connection doing OTA and one other in parallel.

I have also tried sdk version 2.5.3 and 2.6.0-rc1 and am not able to reproduce the issue on 2.5.3 but can reproduce it on 2.6.0-rc1.

I have tried to look over our code to see if there is some processing that is done when parsing the advertise data that should be deffered to later in some way, this does not seem to help the issue.

At the moment I am actually looking into the uart sending part.
The uart sending is done by copying the data to one buffer taken from a pool of buffers and put in a fifo that is read by a seperate thread that calls the uart tx function.

I did some test basically disabling the whole tx part (after some initial setup communication) and could not reproduce the issue.
At that test it looked like lowering the priority of the uart-tx-thread did not help.

I also noticed that by mistake the item put in the fifo was not (explicitly) word aligned, thou it probably would be because the second item in the struct after the intial (reserved) word was for some reason set to alignment 16.

So at the moment I am looking into moving the "disabling" of tx from the entry point to the thread to see how that affects reproducability and if setting the alignment correctly will make a difference, possibly in combination with lowering the tx thread prio.

This in one way feels a bit far fetched but this is the only change that seem to affect reproducability....

BR,
Mårten
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Einar Thorsrud over 1 year ago in reply to Maos

Hi Mårten,

It is good to hear you are making progress in narrowing down the issue. I do not have any suggestions from our side at the moment, but please let me know if you are able to narrow this down futher.

Br,

Einar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Maos over 1 year ago in reply to Einar Thorsrud

Hi

Seems the uart thing was false lead (maybe not so suprising), it might have probability of reproducing somewhat.

Also noticed I had commented out code that initialize nfc at the same time..

I did try to lower irq priority of both uart and nfc in device tree, couldn't find any clear information on this but hoped that a higher number meant lower priority but the assert was still triggered.

One detail I forgot to mention is that it onlly seem to happen if the peripheral is a bit away from the central, the rssi value reported by the SDK is somewhere betwen -85 and -95 when it happens.

Most of the time when this issue happens the connection has been lost once and then been able to reconnect again, and during this second connection this assert kicks in.

I think I have seen this on "first" connection but maybe it has been connected and done a "proper" disconnect but I missed to reset the test setup, so it is still on second connection.

I also noticed a place where the same connection maybe is disconnected twice (so disconnect returns -128).

I also tried with the 2.7.0-rc2 SDK and get a bit different ASSERT:
00> [00:00:38.858,306] [1;31m<err> bt_sdc_hci_driver: SoftDevice Controller ASSERT: 53, 345[0m
00> [00:00:38.858,337] [1;31m<err> os: ***** HARD FAULT *****
00> [00:00:38.858,337] [1;31m<err> os: Fault escalation (see below)[0m
00> [00:00:38.858,367] [1;31m<err> os: ARCH_EXCEPT with reason 3
00>
00> [00:00:38.858,367] [1;31m<err> os: r0/a1: 0x00000003 r1/a2: 0x00000000 r2/a3: 0x000e7241
00> [00:00:38.858,367] [1;31m<err> os: r3/a4: 0x00000003 r12/ip: 0x2001a238 r14/lr: 0xffffffff
00>
00> [00:00:38.858,398] [1;31m<err> os: xpsr: 0x01000011
00> [00:00:38.858,398] [1;31m<err> os: r4/v1: 0x20018168 r5/v2: 0x00033db9 r6/v3: 0x0000000a[0m
00>
00> [00:00:38.858,428] [1;31m<err> os: r7/v4: 0x20018168 r8/v5: 0x00000000 r9/v6: 0x00000000[0m
00> [00:00:38.858,428] [1;31m<err> os: r10/v7: 0x200009a0 r11/v8: 0x00000000 psp: 0x20018480
00> [00:00:38.858,459] [1;31m<err> os: EXC_RETURN: 0xfffffff1[0m
00> [00:00:38.858,459] [1;31m<err> os: Faulting instruction address (r15/pc): 0x00033df4[0m
00> [00:00:38.858,489] [1;31m<err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0[0m
00> [00:00:38.858,489] [1;31m<err> os: Fault during interrupt handling
00>
00>
00> [00:00:38.858,551] [1;31m<err> os: Current thread: 0x2000bd80 (unknown)[0m
00> [00:00:51.399,688] [1;31m<err> os: Halting system[0m

What can be interpreted from this one?

Also could the callback passed to bt_dfu_smp_command block the softdevice (assuming there is no explicit call to irq_lock or similar)?

BR,

Mårten
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Einar Thorsrud over 1 year ago in reply to Maos

Hi Mårten,

I am checking internally to see what we can understand with this information, but I do not have any updates yet unforunately.

Is it possible for you to share a proejct that can reproduce this on a DK so that we can test on our side?

Br,

Einar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 Einar Thorsrud over 1 year ago in reply to Maos

Hi Mårten,

I am checking internally to see what we can understand with this information, but I do not have any updates yet unforunately.

Is it possible for you to share a proejct that can reproduce this on a DK so that we can test on our side?

Br,

Einar
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

0 Maos over 1 year ago in reply to Einar Thorsrud

Hi!

Great any suggestion or advice is appreciated.

I was actually thinking about asking if that was a possibility....
And am working on such a project (also as a way to try to reduce factors affecting it or eliminating mistakes on our side) but not quite finished, I will update here when it is ready.

BR,

Mårten
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Maos over 1 year ago in reply to Maos

Hi!

I have created a reproducer fw, so this should be run as a central.
As I have removed all functionality that gets the fw to be sent during update I have made this compile the fw top transfer as ro-data.

Two things needs local adaption the mac address of the peripherial to connect and update.
And the path to the app_uppdate.bin file to send to the peripherial.
These are located in main.c line 102 and bin_handler.c line 19.

For the peripheral any client with support for dfu-smp and advertising in a compatible manner (coded phy) should be possible to use (and to fit in the central fw it should not be bigger than around 250k),

I have also attached some logs showing the output I get at two different locations running this fw, as I have had some problems recreate the issue in of the locations but as you can see from the logs I now can reproduce it in both locations.

BR,
Mårten

nrf-reproducer.zip

rttlog-office.log

rttlog-home.log
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Edvin over 1 year ago in reply to Maos

Hello Mårten,

Einar is out of office for a few weeks to come, so I will have a look at your ticket as soon as possible, hopefully on Monday. Please understand that we are a bit short staffed due to summer holidays.

Best regards,

Edvin
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Maos over 1 year ago in reply to Edvin

Hi!

Great!

Yes, I understand, I will myself be on vacation from the 18th.

BR,
Mårten
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 Edvin over 1 year ago in reply to Maos

Hello Mårten,

Let me get this straight. This is the central application that is supposed to upload new FW to a peripheral, right? So I would need a peripheral device with an MCUBOOT bootloader running with this? Does the standard smp_svr sample work? Or does it need to be something special?

BR,
Edvin
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel