Zigbee traces use the wrong log level

I am using the light_bulb sample from NCS v2.0.0 with an nRF52840 DK board.  I changed prj.conf to enable Warning traces from ZBOSS:

+# Enable traces on UART_1 (P1.02) @ 115200bps
+CONFIG_ZIGBEE_ENABLE_TRACES=y
+CONFIG_ZBOSS_TRACE_MASK=0x0003
+CONFIG_ZBOSS_TRACE_LOG_LEVEL_WRN=y
+CONFIG_ZBOSS_TRACE_BINARY_LOGGING=y
+CONFIG_ZBOSS_TRACE_LOGGER_DEVICE_NAME="UART_1"

I added test code to the app to verify that the correct loglevel setting was being applied and that the filters are working as intended:

 extern int zb_trace_check(zb_uint_t level, zb_uint_t mask);
LOG_INF("TRACE: %d: %d %d %d %d %d", g_trace_level,
zb_trace_check(0, 1),
zb_trace_check(1, 1),
zb_trace_check(2, 1),
zb_trace_check(3, 1),
zb_trace_check(4, 1));
LOG_INF("TRACE: %d: %d %d %d %d %d", g_trace_level,
zb_trace_check(0, 0x100),
zb_trace_check(1, 0x100),
zb_trace_check(2, 0x100),
zb_trace_check(3, 0x100),
zb_trace_check(4, 0x100));

The output was:

I: TRACE: 2: 1 1 1 0 0
I: TRACE: 2: 0 0 0 0 0

Unfortunately this still generates a massive amount of binary log data on UART1.  All of these look to be extremely mundane operations that hardly merit the Warning loglevel:

2022-06-09 16:23:23,115 RAW [de,ad,0e,02,9a,00,3d,07,18,00,77,02,06,00,00,00]
2022-06-09 16:23:23,115 ts=009a m=0x0002 lev=1 zb_buf_begin_func:631 data=[06,00,00,00]
2022-06-09 16:23:23,115 RAW [de,ad,0e,02,9a,00,3e,07,18,00,17,05,06,00,00,00]
2022-06-09 16:23:23,115 ts=009a m=0x0002 lev=1 zb_buf_get_status_func:1303 data=[06,00,00,00]
2022-06-09 16:23:23,115 RAW [de,ad,0e,02,9a,00,3f,07,18,00,1e,05,00,00,00,00]
2022-06-09 16:23:23,115 ts=009a m=0x0002 lev=1 zb_buf_get_status_func:1310 data=[00,00,00,00]
2022-06-09 16:23:23,120 RAW [de,ad,0e,02,9a,00,40,07,18,00,a1,02,06,00,00,00]
2022-06-09 16:23:23,120 ts=009a m=0x0002 lev=1 zb_buf_len_func:673 data=[06,00,00,00]
2022-06-09 16:23:23,120 RAW [de,ad,0e,02,9a,00,41,07,18,00,a1,02,06,00,00,00]
2022-06-09 16:23:23,120 ts=009a m=0x0002 lev=1 zb_buf_len_func:673 data=[06,00,00,00]
2022-06-09 16:23:23,121 RAW [de,ad,0e,02,9a,00,42,07,18,00,77,02,06,00,00,00]
2022-06-09 16:23:23,121 ts=009a m=0x0002 lev=1 zb_buf_begin_func:631 data=[06,00,00,00]
2022-06-09 16:23:23,121 RAW [de,ad,0e,02,9a,00,43,07,18,00,17,05,06,00,00,00]
2022-06-09 16:23:23,121 ts=009a m=0x0002 lev=1 zb_buf_get_status_func:1303 data=[06,00,00,00]
2022-06-09 16:23:23,136 RAW [de,ad,0e,02,9a,00,44,07,18,00,1e,05,00,00,00,00]

For instance, zb_buf_get_tail_func() is logging at the Error level even when no errors are occurring.

The reason why I worry about this is because I'm trying to capture information about an actual ZBOSS failure, and there is so much data coming out of this UART that it's dropping/corrupting frames.

I see similar issues when using other trace masks too.

Parents
  • Hello,

    Sorry for the basic questions, but what is the logging supposed to output?

    I: TRACE: 2: 1 1 1 0 0
    I: TRACE: 2: 0 0 0 0 0

    The zboss trace log is meant to log a bit, but unless they changed anything, the logs are encrypted, and they need to be sent to the company who provided the zboss stack (via us). Is there somethign particular you want to debug? 

    Best regards,

    Edvin

  • what is the logging supposed to output?

    I'm calling the function that ZBOSS uses to figure out whether to log a given message to the trace output:

    extern int zb_trace_check(zb_uint_t level, zb_uint_t mask);

    My test shows that the log mask and log level set in Kconfig are being respected by ZBOSS.  The problem is that the callers to zb_trace_msg_port() are specifying unreasonably high log level parameters, like logging routine buffer allocation events as errors.

    The zboss trace log is meant to log a bit, but unless they changed anything, the logs are encrypted, and they need to be sent to the company who provided the zboss stack (via us).

    They are not encrypted.  The binary format is trivial to decode; I wrote a couple of simple python scripts to work with them.

    Is there somethign particular you want to debug? 

    Yes, I'd like to reduce the trace log output volume because it's generating output faster than the serial/CDC output can handle it.  I'm troubleshooting a couple of different ZBOSS issues right now and this bug is potentially an impediment.

    Could you please ask DSR to fix the logging levels so that configuring e.g. CONFIG_ZBOSS_TRACE_LOG_LEVEL_WRN=y only outputs warnings rather than spewing debug traces into the binary trace log?

Reply
  • what is the logging supposed to output?

    I'm calling the function that ZBOSS uses to figure out whether to log a given message to the trace output:

    extern int zb_trace_check(zb_uint_t level, zb_uint_t mask);

    My test shows that the log mask and log level set in Kconfig are being respected by ZBOSS.  The problem is that the callers to zb_trace_msg_port() are specifying unreasonably high log level parameters, like logging routine buffer allocation events as errors.

    The zboss trace log is meant to log a bit, but unless they changed anything, the logs are encrypted, and they need to be sent to the company who provided the zboss stack (via us).

    They are not encrypted.  The binary format is trivial to decode; I wrote a couple of simple python scripts to work with them.

    Is there somethign particular you want to debug? 

    Yes, I'd like to reduce the trace log output volume because it's generating output faster than the serial/CDC output can handle it.  I'm troubleshooting a couple of different ZBOSS issues right now and this bug is potentially an impediment.

    Could you please ask DSR to fix the logging levels so that configuring e.g. CONFIG_ZBOSS_TRACE_LOG_LEVEL_WRN=y only outputs warnings rather than spewing debug traces into the binary trace log?

Children
  • I see. I will forward this to our Zigbee team. I will keep you posted.

    Best regards,

    Edvin

  • Hello,

    This is the reply I got from our Zigbee team:

    ---------------------------

    I don't really know what the customer wants to achieve. They have mentioned capturing ZBOSS errors - ZBOSS asserts are used for that by the stack.
    I am guessing that this ticket was created because of the network "stability issues" they previously had.

    I can't verify whether or not these logs does in fact are all about "extremely mundane operations that hardly merit the Warning loglevel" if they can collect these logs and provide them to us.

    When it comes to improving logging the ZBOSS traces, they could speed up UART by changing baudrate can be raised to 1M and/or increase internal ring buffer that stores zboss traces binary data.

    Also, you can inform customer that without access to the ZBOSS stack sources they wouldn't be able to decode the ZBOSS trace messages.

    Could you please ask DSR to fix the logging levels so that configuring e.g. CONFIG_ZBOSS_TRACE_LOG_LEVEL_WRN=y only outputs warnings rather than spewing debug traces into the binary trace log?

    There are no DBG level trace logs in the trace variant of ZBOSS libraries as they are not compiled in. Only logs up to Warning log level are compiled in.

    ----------------------------

    I am not really sure what you are trying to report, and I see that you have a lot of questions/cases here on DevZone. Are they all related? If so, perhaps we can try to boil it down to one ticket. The reason I am saying this is that our Zigbee team has a limited amount of support resources, and now they are seeing a lot of unrelated questions, which makes it take even longer to get a reply. Also, it doesn't help that it is summer holidays in Norway, unfortunately. 

    Best regards,

    Edvin

  • I can't verify whether or not these logs does in fact are all about "extremely mundane operations that hardly merit the Warning loglevel" if they can collect these logs and provide them to us.

    These were posted in text format in the original request.  I'll attach the corresponding binary trace here:

    log.bin

    As I mentioned earlier, I get these rapid-fire debug logs when I enable CONFIG_ZBOSS_TRACE_LOG_LEVEL_WRN=y.  This is quite obviously incorrect, because enabling the log option for "warnings+errors only" shouldn't be generating a flurry of log data on mundane buffer allocation events that aren't experiencing problems.  It should only be logging actual warnings.

    I would like to be able to enable warnings+errors (only) from all ZBOSS modules so I can spot problems, but because of this bug, enabling warnings generates logging overruns and massive amounts of irrelevant data.

    Also, you can inform customer that without access to the ZBOSS stack sources they wouldn't be able to decode the ZBOSS trace messages.

    I wrote my own parser to decode the trace messages, which is able to display many of the fields without source access.  It isn't able to decode the free-form parameters at the end, although sometimes I can figure out what those represent.  I'm using this parser to describe the log entries that I have questions about, as I did in this ticket.

    our Zigbee team has a limited amount of support resources, and now they are seeing a lot of unrelated questions, which makes it take even longer to get a reply

    Unfortunately I am finding a lot of Zigbee-related problems in the course of developing this product.

    When possible I try to fix the bug myself, and I've sent you a couple of pull requests on Github.  But my ability to troubleshoot the closed source libraries is somewhat limited so regrettably many of the bugs do need to be escalated to the Zigbee team.  I appreciate your support in getting to the bottom of things.

  • Hello,

    I discussed this again with our Zigbee team. These are some points that they said I could forward to you:

    The custome's log.bin contains ZBOSS traces from the buffer management subsystem, and these traces does in fact have the Error level assigned to them. These trace logs are compiled into the current certified set of Zigbee stack libraries. These can't be changed because they are on binary form, and changing anything would require them to be re-certified. I'd suggest not focusing on this right now.

    When it comes to the customer's ZBOSS trace decoder, I'd suggest that the ZBOSS traces are collected and provided to the Zigbee team when a problem occurs. This could help us understand the problem a little better and hopefully won't cause additional confusion.

    Let's continue nailing down the issues the customer is facing, one by one - as seen before, some issues may have a common root cause, so I see this as the best approach.

    Both I and the representant from the Zigbee team that have worked on this ticket are out of office until August 1st. I am sorry for the inconveniences this may cause.

    Best regards,

    Edvin

  • The custome's log.bin contains ZBOSS traces from the buffer management subsystem, and these traces does in fact have the Error level assigned to them. These trace logs are compiled into the current certified set of Zigbee stack libraries. These can't be changed because they are on binary form, and changing anything would require them to be re-certified. I'd suggest not focusing on this right now.

    I understand that this can't be fixed immediately without breaking certification.

    Can we please ask DSR to correct the log levels in the next ZBOSS release, so that when we request CONFIG_ZBOSS_TRACE_LOG_LEVEL_WRN we only get warnings+errors?

    When it comes to the customer's ZBOSS trace decoder, I'd suggest that the ZBOSS traces are collected and provided to the Zigbee team when a problem occurs. This could help us understand the problem a little better and hopefully won't cause additional confusion.

    Yes, I have been providing both parsed text logs and raw unparsed binary logs in most of my tickets.  The parsed logs are useful in that they have timestamps that let us correlate externally observed events, like ZC warnings and sniffed packets, with specific internal ZBOSS log entries.  Especially when the logs are large and span several hours of activity.

    I didn't have the raw binary logs when I first opened this ticket, because I had not yet written a program that collects them.

    Let's continue nailing down the issues the customer is facing, one by one - as seen before, some issues may have a common root cause, so I see this as the best approach.

    OK.  With respect to the other issues:

    • 291057, 290428, 290990, 286245 all seem to be caused by inadequate ZBOSS memory buffer allocation in the SDK.  i.e. the definitions shipped in the SDK, even the _max case, are inadequate for even a relatively small Zigbee setup with 5-10 other nodes.  I believe that the Zigbee team has acknowledged this as a problem.  Can we have the default SDK memory configurations updated to more reasonable values, so that we can get those tickets closed out?
    • The most annoying (and user-visible) bug I'm seeing right now is  zb_address_short_by_ref() inexplicably returns 0xffff and relatedly,  Is ZB_APS_ADDR_MODE_BIND_TBL_ID actually supported? . Both in the lab and in customer installations, this makes it unnecessarily troublesome to deploy one of our products.  I would prioritize that one right now.
    • For 290908, 289424, 286748 I am still seeing anomalies in the logs.  In my testing I haven't been able to figure out whether these anomalies will eventually manifest themselves as user-visible behavior.  I'd like to get some more feedback from the Zigbee team as to what sorts of failure stats (random rejoins, packet retries, etc.) they see when they set up a moderately complex Zigbee network in their lab, with a mix of ZBOSS ED, ZBOSS ZR, and third party nodes.  Presumably somebody has done this sort of testing so it's just a matter of seeing whether the results in my test setup match the results from a reference setup.
    • I'm also in the process of trying to figure out whether the Zigbee modem performance on the nRF52840 Dongle design is sub-par compared to other boards.  This is important because we have been assuming that the Dongle hardware design is suitable for our application and can be copied as-is; however we see lots of TX retries and suspect that RF performance is compromised in some way.  If you have any information on this, it would save me the hassle of deploying another test setup.
    Both I and the representant from the Zigbee team that have worked on this ticket are out of office until August 1st. I am sorry for the inconveniences this may cause.

    Understood.  Enjoy the time off Sunglasses

Related