This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

nRF52840 NCP become leader and stop receiving from old network

Hi,

I am developing an application using the nRF52840 as NCP with a raspberry pi as the primary processor, using wpantund. I have 3 devices (3 raspberry, each with its own nRF52840) in the same Thread network, but at some point one of them stop receiving messages from the others (my application uses UDP, but it is irrelevant), and when I look at the log i see that it became leader, but without any other board outputting any message or the previous leader leaving it's status :

Device 1 (OK, original leader):

Jun 12 09:07:55 raspberrypi wpantund[416]: Starting wpantund 0.08.00d (May 14 2018 17:06:58) . . .
Jun 12 09:07:55 raspberrypi wpantund[416]: #011SOURCE_VERSION = 0.07.01-209-ga88988a
Jun 12 09:07:55 raspberrypi wpantund[416]: #011BUILD_VERSION = 0.07.01-209-ga88988a
Jun 12 09:07:55 raspberrypi wpantund[416]: Configuration file "/etc/wpantund.conf" read.
Jun 12 09:07:55 raspberrypi wpantund[416]: Ready. Using DBUS bus ":1.3"
Jun 12 09:07:55 raspberrypi wpantund[416]: Running as root without dropping privileges!
Jun 12 09:07:55 raspberrypi wpantund[416]: [-NCP-]: NCP was reset (STATUS_RESET_EXTERNAL, 113)
Jun 12 09:07:55 raspberrypi wpantund[416]: State change: "uninitialized" -> "offline"
Jun 12 09:07:55 raspberrypi wpantund[416]: NCP is running "OPENTHREAD/20170716-00532-gec59d7e-dirty; NRF52840; Apr  6 2018 23:55:28"
Jun 12 09:07:55 raspberrypi wpantund[416]: Driver is running "0.08.00d (0.07.01-209-ga88988a; May 14 2018 17:06:58)"
Jun 12 09:07:55 raspberrypi wpantund[416]: Network is not joinable
Jun 12 09:07:55 raspberrypi wpantund[416]: State change: "offline" -> "offline:commissioned"
Jun 12 09:07:55 raspberrypi wpantund[416]: Finished initializing NCP
Jun 12 09:07:55 raspberrypi wpantund[416]: AutoResume is enabled. Trying to resume.
Jun 12 09:07:55 raspberrypi wpantund[416]: NCP is commissioned. Resuming...
Jun 12 09:07:55 raspberrypi wpantund[416]: State change: "offline:commissioned" -> "associating"
Jun 12 09:07:59 raspberrypi wpantund[416]: State change: "associating" -> "associated"
Jun 12 09:07:59 raspberrypi wpantund[416]: Node type change: "unknown" -> "leader"

Device 2 (Faulty):

Jun 12 09:07:57 raspberrypi wpantund[423]: Starting wpantund 0.08.00d (May 14 2018 17:06:58) . . .
Jun 12 09:07:57 raspberrypi wpantund[423]: #011SOURCE_VERSION = 0.07.01-209-ga88988a
Jun 12 09:07:57 raspberrypi wpantund[423]: #011BUILD_VERSION = 0.07.01-209-ga88988a
Jun 12 09:07:57 raspberrypi wpantund[423]: Configuration file "/etc/wpantund.conf" read.
Jun 12 09:07:57 raspberrypi wpantund[423]: Ready. Using DBUS bus ":1.3"
Jun 12 09:07:57 raspberrypi wpantund[423]: Running as root without dropping privileges!
Jun 12 09:07:57 raspberrypi wpantund[423]: [-NCP-]: NCP was reset (STATUS_RESET_EXTERNAL, 113)
Jun 12 09:07:57 raspberrypi wpantund[423]: State change: "uninitialized" -> "offline"
Jun 12 09:07:57 raspberrypi wpantund[423]: NCP is running "OPENTHREAD/20170716-00532-gec59d7e-dirty; NRF52840; Apr  6 2018 23:55:28"
Jun 12 09:07:57 raspberrypi wpantund[423]: Driver is running "0.08.00d (0.07.01-209-ga88988a; May 14 2018 17:06:58)"
Jun 12 09:07:57 raspberrypi wpantund[423]: HARDWARE ADDRESS IS INVALID, MULTICAST BIT IS SET!
Jun 12 09:07:57 raspberrypi wpantund[423]: Network is not joinable
Jun 12 09:07:57 raspberrypi wpantund[423]: State change: "offline" -> "offline:commissioned"
Jun 12 09:07:57 raspberrypi wpantund[423]: Finished initializing NCP
Jun 12 09:07:57 raspberrypi wpantund[423]: AutoResume is enabled. Trying to resume.
Jun 12 09:07:57 raspberrypi wpantund[423]: NCP is commissioned. Resuming...
Jun 12 09:07:57 raspberrypi wpantund[423]: State change: "offline:commissioned" -> "associating"
Jun 12 09:08:02 raspberrypi wpantund[423]: State change: "associating" -> "associated"
Jun 12 09:08:02 raspberrypi wpantund[423]: Node type change: "unknown" -> "end-device"
Jun 12 09:08:03 raspberrypi wpantund[423]: Node type change: "end-device" -> "router"
 < 2 hours later >
Jun 12 11:07:01 raspberrypi wpantund[423]: Node type change: "router" -> "leader"

Device 3 (OK):

Jun 12 09:07:58 raspberrypi wpantund[426]: Starting wpantund 0.08.00d (May 14 2018 17:06:55) . . .
Jun 12 09:07:58 raspberrypi wpantund[426]: #011SOURCE_VERSION = 0.07.01-209-ga88988a
Jun 12 09:07:58 raspberrypi wpantund[426]: #011BUILD_VERSION = 0.07.01-209-ga88988a
Jun 12 09:07:58 raspberrypi wpantund[426]: Configuration file "/etc/wpantund.conf" read.
Jun 12 09:07:58 raspberrypi wpantund[426]: Ready. Using DBUS bus ":1.3"
Jun 12 09:07:58 raspberrypi wpantund[426]: Running as root without dropping privileges!
Jun 12 09:07:58 raspberrypi wpantund[426]: [-NCP-]: NCP was reset (STATUS_RESET_EXTERNAL, 113)
Jun 12 09:07:58 raspberrypi wpantund[426]: State change: "uninitialized" -> "offline"
Jun 12 09:07:58 raspberrypi wpantund[426]: NCP is running "OPENTHREAD/20170716-00532-gec59d7e-dirty; NRF52840; Apr  6 2018 23:55:28"
Jun 12 09:07:58 raspberrypi wpantund[426]: Driver is running "0.08.00d (0.07.01-209-ga88988a; May 14 2018 17:06:55)"
Jun 12 09:07:58 raspberrypi wpantund[426]: Network is not joinable
Jun 12 09:07:58 raspberrypi wpantund[426]: State change: "offline" -> "offline:commissioned"
Jun 12 09:07:58 raspberrypi wpantund[426]: Finished initializing NCP
Jun 12 09:07:58 raspberrypi wpantund[426]: AutoResume is enabled. Trying to resume.
Jun 12 09:07:58 raspberrypi wpantund[426]: NCP is commissioned. Resuming...
Jun 12 09:07:58 raspberrypi wpantund[426]: State change: "offline:commissioned" -> "associating"
Jun 12 09:08:01 raspberrypi wpantund[426]: State change: "associating" -> "associated"
Jun 12 09:08:01 raspberrypi wpantund[426]: Node type change: "unknown" -> "end-device"
Jun 12 09:08:04 raspberrypi wpantund[426]: Node type change: "end-device" -> "router"

Looking at my application network log i can see that the last message received by it is at 11:04:37, so 2:30 before becoming leader.

The other devices receive their last message from Device 2 at 11:04:25 for both Device 1 and 3, despite Device 2 trying to send several messages after 11:04:25 (even after becoming the "new" leader).

Do you ever have encountered such a issue ? To me it looks like one board stopped receiving messages for any reason, and after a while decided to start it's own network.

After a reboot it all restart normally and the devices are in the same network and communicating fine again.

I am using the nRF52840 Hex from the "nRF5_SDK_for_Thread_and_Zigbee_v1.0.0" (example/thread/ncp/usb/hex/nrf52840_xxaa.hex) and wpantund commit a88988a (28 april 2018). It is the first and only time this bug occurred, despite several run longer than 2 hours.

My application is running on the raspberry and just using the network interface wpan0 provided by the wpantund driver, without needing/using any further knowledge.

Thank you,

Virgile

Edit: Same issue appeared again (from device 3 this time), it was not happening with only 2 boards, that's why I said it was the first time it appeared in the network (I only tested with 2 devices until now). This time I didn't kill the process immediatly, so I run a wpanctl status, and I get Device 1 and 3 both output Network:NodeType = Leader, with identical Network Name, PANID, XPANID, channel and mesh local prefix, so basically the same thread network. I also did some network capture (wpanctl pcap + wireshark) from all three devices, and there are not packets from the new fake leader visible on any of the 3 nodes, even the fake leader. The devices other than this new fake leader still communicate (device 1 and 2) correctly.

This time again the incorrect devices can't send or receive about 2 min before decided to become a new leader.

This is not a range issue, as the 3 boards are on the same table at about 5/10 cm from each other

Parents
  • Hi,

    Our developers have been able to reproduce this behavior and are investigating this issue.

    Could you run the following commands on your border router(s), and post the output in this thread?

    wpanctl getprop
    wpanctl getprop OpenThread:MsgBufferCounters:AsString

    Best regards,
    Jørgen

  • You can provide the output of the commands regardless of if the issue have ocurred or not, it should give static information.

    Yes, the buffer size is of interest to consider memory leaks, etc. If you could provide sniffer logs, please attach it and I will forward it to the developers in case it could be useful

  • experiment.tar.gz

    Here is an archive of the latest experiment that failed, this time it is the device 01 that failed (it was a router). There is the output of the commands for each devices and a capture performed using pyspinel on another PC. Device01 address is 0xcc00 and its last message is at packet 319922 (15:53:52.525670). Hope it helps.

    It is the sending that fails first,  the buggy devices still receives messages until 15:53:57.677357 (that is around packet 320101), and the syslog indicate that it went from router to leader at 15:56:45 :

    Jun 13 15:56:45 raspberrypi wpantund[422]: Node type change: "router" -> "leader"
    Jun 13 16:08:47 raspberrypi wpantund[422]: SpinelNCPTaskSendCommand.cpp:322: Requirement Failed ((mRetVal) == 0)
    Jun 13 16:08:47 raspberrypi wpantund[422]: SendCommand task encountered an error: 16 (0x00000010)

Reply
  • experiment.tar.gz

    Here is an archive of the latest experiment that failed, this time it is the device 01 that failed (it was a router). There is the output of the commands for each devices and a capture performed using pyspinel on another PC. Device01 address is 0xcc00 and its last message is at packet 319922 (15:53:52.525670). Hope it helps.

    It is the sending that fails first,  the buggy devices still receives messages until 15:53:57.677357 (that is around packet 320101), and the syslog indicate that it went from router to leader at 15:56:45 :

    Jun 13 15:56:45 raspberrypi wpantund[422]: Node type change: "router" -> "leader"
    Jun 13 16:08:47 raspberrypi wpantund[422]: SpinelNCPTaskSendCommand.cpp:322: Requirement Failed ((mRetVal) == 0)
    Jun 13 16:08:47 raspberrypi wpantund[422]: SendCommand task encountered an error: 16 (0x00000010)

Children
No Data
Related