[Bug report] sys_csrand_get crashes after being called exactly 10000 times on nRF9160

The title says it all. Code to reproduce:

#include <zephyr/random/random.h>

printf("Hello!\n");
for (int i = 1; i <= 1000000; i++) {
	if (i > 9990) {
		printf("%d\n", i);
	}
	uint32_t buf[1] = {0};
	sys_csrand_get(buf, sizeof(buf));
}

Put this at the very first thing in main() in e.g. the cellular/udp sample (configured to use nrf9160dk/nrf9160/ns with the latest nrf connect sdk v3.2.4) and run. At the 10000th iteration, the code will crash. The buffer size per iteration does not seem to have any impact on the number of iterations before crashing. My suspicion is that the implementation fetches more entropy or something each 10000 iterations, causing a longer call chain. The cc310 crypto library seems to be closed source, so it's not possible to debug.

The bug is triggered when using optimization setting "Use project default" or "Optimize for debugging (-Og)", but not with "Optimize for size (-Os)" or "Optimize for speed (-O2)".

It seems the ns_agent_tz_stack is made too small, and the issue is actually a stack overflow that happens inside the secure tfm (while running in thread mode, i.e. using the psp_s stack pointer). But if the code is optimized, I guess it optimizes the stack usage, preventing the issue.

After changing the PLATFORM_SP_STACK_SIZE define from 0x500 to e.g. 0x700, the issue doesn't happen anymore. Note that the size of this stack seems to be hardcoded in the SDK code and it is not meant to be modified by users if I understand correctly. Therefore it should be seen as a bug in the SDK that needs to be fixed.

It is problematic that it happens after the 10000th time the sys_csrand_get function is called, as that might hide the existence of the bug if you send out units in production and they suddenly crash after a while.

Parents
  • Hi,

    Thank you for the report. I reproduce the same on my end using the cellular/udp sample in SDK 3.2.4 and nRF9160 with your code snippet, and a few changes. I need to build with debug to see this on my end, though. Diff for ref:

    diff --git a/samples/bluetooth/peripheral_gatt_dm/prj.conf b/samples/bluetooth/peripheral_gatt_dm/prj.conf
    index 174544f767..a7425890bf 100644
    --- a/samples/bluetooth/peripheral_gatt_dm/prj.conf
    +++ b/samples/bluetooth/peripheral_gatt_dm/prj.conf
    @@ -20,3 +20,5 @@ CONFIG_LOG=y
     CONFIG_BT_GATT_DM_DATA_PRINT=y
     
     CONFIG_DK_LIBRARY=y
    +
    +
    diff --git a/samples/bluetooth/peripheral_gatt_dm/src/main.c b/samples/bluetooth/peripheral_gatt_dm/src/main.c
    index 8664b8265d..c5267242ed 100644
    --- a/samples/bluetooth/peripheral_gatt_dm/src/main.c
    +++ b/samples/bluetooth/peripheral_gatt_dm/src/main.c
    @@ -185,6 +185,7 @@ int main(void)
     	int err;
     
     	printk("Starting GATT Discovery Manager sample\n");
    +	printk("BT_GATT_DM_WORKQ_STACK_SIZE: %d\n", CONFIG_BT_GATT_DM_WORKQ_STACK_SIZE);
     
     	err = bt_enable(NULL);
     	if (err) {
    diff --git a/samples/cellular/udp/prj.conf b/samples/cellular/udp/prj.conf
    index 62ca562a16..f55dc10546 100644
    --- a/samples/cellular/udp/prj.conf
    +++ b/samples/cellular/udp/prj.conf
    @@ -7,7 +7,7 @@
     # General config
     CONFIG_PICOLIBC_IO_FLOAT=y
     CONFIG_NCS_SAMPLES_DEFAULTS=y
    -CONFIG_SERIAL=n
    +CONFIG_SERIAL=y
     
     # Network
     CONFIG_NETWORKING=y
    @@ -43,3 +43,5 @@ CONFIG_UDP_RAI_ENABLE=y
     CONFIG_LTE_LC_EDRX_MODULE=y
     CONFIG_LTE_LC_PSM_MODULE=y
     CONFIG_LTE_LC_RAI_MODULE=y
    +
    +#CONFIG_TFM_PLATFORM_SP_STACK_SIZE=0x700
    diff --git a/samples/cellular/udp/src/main.c b/samples/cellular/udp/src/main.c
    index ab78c31f0b..27eeb64b37 100644
    --- a/samples/cellular/udp/src/main.c
    +++ b/samples/cellular/udp/src/main.c
    @@ -10,6 +10,7 @@
     #include <modem/lte_lc.h>
     #include <modem/nrf_modem_lib.h>
     #include <nrf_modem_at.h>
    +#include <zephyr/random/random.h>
     
     #define UDP_IP_HEADER_SIZE 28
     
    @@ -168,6 +169,15 @@ int main(void)
     {
     	int err;
     
    +	printk("Hello!\n");
    +	for (int i = 1; i <= 1000000; i++) { // 1000000
    +		if (i > 9990) {
    +			printk("%d\n", i);
    +		}
    +		uint32_t buf[1] = {0};
    +		sys_csrand_get(buf, sizeof(buf));
    +	}
    +
     	printk("UDP sample has started\n");
     
     	work_init();
    

    I have forwarded this internally and we will look into how we can address this.

Reply
  • Hi,

    Thank you for the report. I reproduce the same on my end using the cellular/udp sample in SDK 3.2.4 and nRF9160 with your code snippet, and a few changes. I need to build with debug to see this on my end, though. Diff for ref:

    diff --git a/samples/bluetooth/peripheral_gatt_dm/prj.conf b/samples/bluetooth/peripheral_gatt_dm/prj.conf
    index 174544f767..a7425890bf 100644
    --- a/samples/bluetooth/peripheral_gatt_dm/prj.conf
    +++ b/samples/bluetooth/peripheral_gatt_dm/prj.conf
    @@ -20,3 +20,5 @@ CONFIG_LOG=y
     CONFIG_BT_GATT_DM_DATA_PRINT=y
     
     CONFIG_DK_LIBRARY=y
    +
    +
    diff --git a/samples/bluetooth/peripheral_gatt_dm/src/main.c b/samples/bluetooth/peripheral_gatt_dm/src/main.c
    index 8664b8265d..c5267242ed 100644
    --- a/samples/bluetooth/peripheral_gatt_dm/src/main.c
    +++ b/samples/bluetooth/peripheral_gatt_dm/src/main.c
    @@ -185,6 +185,7 @@ int main(void)
     	int err;
     
     	printk("Starting GATT Discovery Manager sample\n");
    +	printk("BT_GATT_DM_WORKQ_STACK_SIZE: %d\n", CONFIG_BT_GATT_DM_WORKQ_STACK_SIZE);
     
     	err = bt_enable(NULL);
     	if (err) {
    diff --git a/samples/cellular/udp/prj.conf b/samples/cellular/udp/prj.conf
    index 62ca562a16..f55dc10546 100644
    --- a/samples/cellular/udp/prj.conf
    +++ b/samples/cellular/udp/prj.conf
    @@ -7,7 +7,7 @@
     # General config
     CONFIG_PICOLIBC_IO_FLOAT=y
     CONFIG_NCS_SAMPLES_DEFAULTS=y
    -CONFIG_SERIAL=n
    +CONFIG_SERIAL=y
     
     # Network
     CONFIG_NETWORKING=y
    @@ -43,3 +43,5 @@ CONFIG_UDP_RAI_ENABLE=y
     CONFIG_LTE_LC_EDRX_MODULE=y
     CONFIG_LTE_LC_PSM_MODULE=y
     CONFIG_LTE_LC_RAI_MODULE=y
    +
    +#CONFIG_TFM_PLATFORM_SP_STACK_SIZE=0x700
    diff --git a/samples/cellular/udp/src/main.c b/samples/cellular/udp/src/main.c
    index ab78c31f0b..27eeb64b37 100644
    --- a/samples/cellular/udp/src/main.c
    +++ b/samples/cellular/udp/src/main.c
    @@ -10,6 +10,7 @@
     #include <modem/lte_lc.h>
     #include <modem/nrf_modem_lib.h>
     #include <nrf_modem_at.h>
    +#include <zephyr/random/random.h>
     
     #define UDP_IP_HEADER_SIZE 28
     
    @@ -168,6 +169,15 @@ int main(void)
     {
     	int err;
     
    +	printk("Hello!\n");
    +	for (int i = 1; i <= 1000000; i++) { // 1000000
    +		if (i > 9990) {
    +			printk("%d\n", i);
    +		}
    +		uint32_t buf[1] = {0};
    +		sys_csrand_get(buf, sizeof(buf));
    +	}
    +
     	printk("UDP sample has started\n");
     
     	work_init();
    

    I have forwarded this internally and we will look into how we can address this.

Children
No Data
Related