at_parse_process_element function in the at_cmd_parser.c doesn't process k_malloc fault properly which leads to random and unexpected network disconnect

I spent quite some time trying to understand why I was getting random network disconnect after a totally fine CEREG notification, like in this log:

[00:33:20.523,223] <dbg> lte_lc: at_handler_cscon: +CSCON notification
[00:33:20.523,376] <dbg> lte_lc: event_handler_list_dispatch: Dispatching event: type=3
[00:33:20.523,406] <dbg> lte_lc: event_handler_list_dispatch: - handler=0x0004BDB9
[00:33:20.523,437] <dbg> lte_lc: event_handler_list_dispatch: - handler=0x000406BD
[00:33:20.523,437] <dbg> lte_lc: event_handler_list_dispatch: Done
[00:33:20.931,365] <dbg> lte_lc: at_handler_cereg: +CEREG notification: +CEREG: 5,"36EC","010B902B",7,,,"11100000","11100000"
[00:33:20.931,640] <dbg> lte_lc: parse_cereg: Network registration status: 5
[00:33:20.931,671] <dbg> lte_lc: parse_cereg: LTE mode: 7
[00:33:20.931,671] <dbg> lte_lc: parse_cereg: Active time not found, error: -22
[00:33:20.931,701] <dbg> lte_lc: parse_cereg: TAU not found, error: -22
[00:33:20.931,823] <dbg> lte_lc: event_handler_list_dispatch: Dispatching event: type=0
[00:33:20.931,823] <dbg> lte_lc: event_handler_list_dispatch: - handler=0x0004BDB9
[00:33:20.931,945] <dbg> lte_lc: event_handler_list_dispatch: - handler=0x000406BD
[00:33:20.931,945] <dbg> lte_lc: event_handler_list_dispatch: Done
[00:33:20.931,976] <dbg> lte_lc: event_handler_list_dispatch: Dispatching event: type=4
[00:33:20.932,006] <dbg> lte_lc: event_handler_list_dispatch: - handler=0x0004BDB9
[00:33:20.932,006] <dbg> lte_lc: event_handler_list_dispatch: - handler=0x000406BD
[00:33:20.932,037] <dbg> lte_lc: event_handler_list_dispatch: Done
[00:33:20.932,342] <inf> network: Network connectivity lost
[00:33:20.932,586] <dbg> mqtt_helper: mqtt_state_set: State transition: MQTT_STATE_CONNECTED --> MQTT_STATE_DISCONNECTING
[00:33:20.932,739] <err> led: Network disconnected

It turns out that inside the at_parse_process_element function error code returned by an at_params_string_put call (due to failed k_malloc) is left unprocessed, which results in incorrect parsing of the +CEREG notification and, consequently, weird firmware issues (in my case - spurious and irrecoverable network disconnects at every cell ID switch). Unfortunately, I haven't found a way to submit an issue in Github repo for NRF Connect SDK, so I decided to report it here. I'd suggest Nordic improve error handling to report the issue in a log. The solution was simple increase of the heap with the help of CONFIG_HEAP_MEM_POOL_SIZE config. Hopefully, it will help somebody to avoid similar problems.

Parents Reply Children
No Data
Related