nrf9160: Modem error: 0x4, PC: 0x7ef22

Hi Nordic Team,

long story short: I get a

[10:04:59.928,985] <err> nrf_modem: Modem error: 0x4, PC: 0x7ef22

and after that the modem does not allow me to open a socket anymore.

ncs v2.1.2
modem fw: nrf9160_1.3.3 (on SICA B1)

Any ideas of how to prevent this?
Any ideas of how I could detect this and how to act on it to recover and beeing able to send data again?


The long story:

Modem is running in LTE-M + GPS.
Modem is in PSM (no eDRX).

The application (test setup) sends data via UDP every 5 Minutes (thus waking the modem up from PSM).
Each time the aplication opens a new socket, sends one or more udp packets, listens a few seconds for
an answer and closes the socket afterwards.
The sim card allows roaming and I have seen the modem changing cells and networks for time to time
(not shure if it happend during this power up but can look it up if it helps)
Single GPS fixes are done normally when the modem is in PSM.

[10:04:59.832,763] <dbg> llp: llp_transmit: 1 measurement packets sent
[10:04:59.928,985] <err> nrf_modem: Modem error: 0x4, PC: 0x7ef22
[10:04:59.929,107] <err> llp: recvfrom error: 11
[10:05:06.034,271] <dbg> tbws: main: Updating timing (waited 5000 ms)
[10:05:06.034,332] <inf> tbws: Updating time
[10:05:06.046,447] <err> lte_lc: Could not get registration status, error: -1


- "Updating time" is triggering a date_time_update_async(). So the "lte_lc: Could not get registration status, error: -1" might come from there.
- recvfrom error is the errno variable (11: EAGAIN)

After that (5 minutes later, when trying to do the next upload) calling socket() gives:
[10:10:00.277,374] <err> llp: socket: 110

110 is the errno: ESHUTDOWN

After quite some more tries to send (can count them if it helps you):
[10:50:00.262,969] <err> llp: socket: 105

110 is the errno: ENOBUFS


The modem does not seem to recover.

What would be the apropriate way to handle this?
Shutting down the modem and reregister after getting an ESHUTDOWN?
Resetting the system (I'd like to avoid this as I would loose my saved data)?

What pops into my mind: the recvfrom returning EAGAIN in this situation seems to be dangerous.
When I would handle that case I might call recvfrom again, maybe in a loop, and thus locking up.
My current implemntation luckily just bails out an any error. But might be a trap for others.

And: Sorry, no modem traces.

Regards,
Clemens





Parents
  • Hello again,

    Would it be possible to get a modem trace anyway?

    It looks like it is a blocking thing in that case.

    Best regards,

    Michal

  • Hi Michal,

    unfortunately I do not have a trace. It happened in a long run experiment after one week where I just logged my debug output. I then used a custom modem error handler that just resets the system. I have 24 units running for a few weeks now (somewhere out in the nature running on batteries, so no trace collector attached) and I have seen three resets due to modem errors but do not know the error code or PC for them.

    I know it does not help you but may help others: For me adding resets (watchdog triggering when not receiving anything for time x, reset on modem error) finally made the thing "reliable". It is ugly but during development I came across so many (undocumented or documented at places you don't look at while dealing with the Zephyr API) reasons things could go wrong that I decided to go this way. Not the best for battery life, absolutely not my favorite style of programming but works.


    I came from a FreeRTOS world, where I tried to cleanly deal with errors that could happen in the system while trying to avoid complex scenarios (like dynamic memory allocation) and get things as error free as possible. Just alone that Zephyr adds so much complexity makes this approach impossible. My personal opinion on that: For these small, low power devices, with this small processor (with limited peripherals), Zephyr and all these layers are a complete overkill and unmanageable in a clean way. Especially for the intended applications: small low power devices (with low complexity in their task) that may run for years.

    Best regards,

    Clemens

Reply
  • Hi Michal,

    unfortunately I do not have a trace. It happened in a long run experiment after one week where I just logged my debug output. I then used a custom modem error handler that just resets the system. I have 24 units running for a few weeks now (somewhere out in the nature running on batteries, so no trace collector attached) and I have seen three resets due to modem errors but do not know the error code or PC for them.

    I know it does not help you but may help others: For me adding resets (watchdog triggering when not receiving anything for time x, reset on modem error) finally made the thing "reliable". It is ugly but during development I came across so many (undocumented or documented at places you don't look at while dealing with the Zephyr API) reasons things could go wrong that I decided to go this way. Not the best for battery life, absolutely not my favorite style of programming but works.


    I came from a FreeRTOS world, where I tried to cleanly deal with errors that could happen in the system while trying to avoid complex scenarios (like dynamic memory allocation) and get things as error free as possible. Just alone that Zephyr adds so much complexity makes this approach impossible. My personal opinion on that: For these small, low power devices, with this small processor (with limited peripherals), Zephyr and all these layers are a complete overkill and unmanageable in a clean way. Especially for the intended applications: small low power devices (with low complexity in their task) that may run for years.

    Best regards,

    Clemens

Children
Related