Testing and extensive debugging of Matter-based application

Hello
I am currently developing a custom Matter device (dimmer, to be exact) with UI (buttons and buzzer) and using Matter Template app as a foundation for my code.
Right now, I have a working device with required functionality, but during testing with Matter protocol I noticed unwanted behavior. 
This behavior consists of main application random freezing (sometimes even for >1sec) during execution of Matter Cluster commands like 'Move To Level' with non-zero duration.
This is what my current findings are:
1) This issue is not related to my own custom code. This was checked by me by commenting all custom code, leaving only AppTask Event queue scheduler running and deleted portions of code responsible for button and LED handling by example app 
2) Disabling or enabling logging and other debugging functions does not change this behavior. With default logging enabled, I can see that during these freezes logging and everything that outputs logs (cluster functionality, main app etc.) also stop.
Although I am quite new to NRF Connect SDK and Zephyr OS functionality, I can deduct that this issue is somehow connected with some critical thread starving and taking CPU cycles or not releasing CPU for other threads. However, I still need help in understanding how to run an extensive debugging process and how to pinpoint the source of my problems.
First of all, I cannot use nrf debug effectively because any breakpoint or pause in code execution will crash the application due to ASSERT error thrown either by SoftDevice or MPSL. Is there any way to see current status of the threads without pausing code execution?
Or is there any way to use logging functionality without J-link debugging to pinpoint the source of freezes?

  • Hi,

    I have started looking into your issue and will get back to you tomorrow.

    Best regards,

    Marte

  • Hello,
    Thank you for your attention. I tried to resolve this issue myself and tried to use different options available for debugging, although with no success.
    Here I will publish current findings.
    I added additional clusters and supplemented code to my Matter app using guide 'Adding clusters to Matter application (developer.nordicsemi.com/.../ug_matter_gs_adding_clusters.html) as a reference. The code I used to push changed attribute values to my application is below:

    #include "app_task.h"
    
    #include <app-common/zap-generated/ids/Attributes.h>
    #include <app-common/zap-generated/ids/Clusters.h>
    #include <app/ConcreteAttributePath.h>
    #include <app-common/zap-generated/attributes/Accessors.h>
    
    
    
    using namespace ::chip;
    using namespace ::chip::app::Clusters;
    
    
    
    void MatterPostAttributeChangeCallback(
        const chip::app::ConcreteAttributePath & attributePath,
        uint8_t type, uint16_t size, uint8_t * value)
    {
        AppEvent event;
        ClusterId clusterId = attributePath.mClusterId;
    	AttributeId attributeId = attributePath.mAttributeId;
        if (clusterId == OnOff::Id && attributeId == OnOff::Attributes::OnOff::Id)
        {
    
            event.Type = AppEventType::DimmerOnOff;
            event.DimmerEvent.State = *value;
            event.Handler = AppTask::DimmerCHIPControl;
            AppTask::Instance().PostEvent(event);
        }
        else if (clusterId == LevelControl::Id && attributeId == LevelControl::Attributes::CurrentLevel::Id)
        {
            bool onoff;
            OnOff::Attributes::OnOff::Get(attributePath.mEndpointId, &onoff);
            if (onoff)
            {
                event.Type = AppEventType::DimmerLevelControl;
                event.DimmerEvent.NewLevel = *value;
                event.Handler = AppTask::DimmerCHIPControl;
                AppTask::Instance().PostEvent(event);
            }
    
        }
    }


    My application handler to Matter attribute change:
    void AppTask::DimmerCHIPControl(const AppEvent &event)
    {
    	LOG_INF("current uptime %lld", k_uptime_get());
    	uint8_t buzzer_duration;
    	if (event.Type == AppEventType::DimmerOnOff)
    	{
    
    		sCh0IsOn = event.DimmerEvent.State;
    		buzzer_duration = 100U;
    	}
    	if (event.Type == AppEventType::DimmerLevelControl)
    	{
    		buzzer_duration = 50U;
    		sCh0Brightness = event.DimmerEvent.NewLevel;
    		LOG_INF("setting new dimmer brightness 0..254 %d", sCh0Brightness);
    	}
    	buzzer_squeak(BUZZER_STANDARD_VOLUME, buzzer_duration);
    	leds[LED_CH0].channelState = sCh0IsOn;
    	dimmer_brightness_set(sCh0Brightness*4U + 1U);
    	dimmer_state_set(sCh0IsOn);
    }


    My test setup consists of precompiled chip-tool app on Ubuntu 20.04 VM and OTBR in Docker on the same VM, which works fine. This reported issue appears during pushing commands like 'move-to-level' (level control cluster) with non-zero transition time:
    ./chip-tool-release levelcontrol move-to-level 20 2 0 0 222 1


    Currently, I'm outputting logs into my serial terminal (115200 baud rate), using CONFIG_MATTER_LOG_LEVEL_DEBUG.
    In my logs, that I attached below, I can see that I have ~1.5sec delay between consecutive attribute change callbacks:
    D: 1007718 [DMG]Endpoint 1, Cluster 0x0000_0008 update version to 445529ae
    I: 1007725 [ZCL]Event: move from 173
    I: 1007728 [ZCL] to 172 
    I: 1007730 [ZCL](diff -1)
    D: 1007750 [DMG]Endpoint 1, Cluster 0x0000_0008 update version to 445529af
    I: current uptime 1007756
    I: setting new dimmer brightness 0..254 172
    D: 1007764 [DMG]Endpoint 1, Cluster 0x0000_0005 update version to 108217c8
    D: 1007771 [DMG]Endpoint 1, Cluster 0x0000_0008 update version to 445529b0
    I: 1007777 [ZCL]Event: move from 172
    I: 1007781 [ZCL] to 171 
    I: 1007783 [ZCL](diff -1)
    D: 1007802 [DMG]Endpoint 1, Cluster 0x0000_0008 update version to 445529b1
    I: current uptime 1007809
    I: setting new dimmer brightness 0..254 171
    D: 1007816 [DMG]Endpoint 1, Cluster 0x0000_0005 update version to 108217c9
    D: 1007823 [DMG]Endpoint 1, Cluster 0x0000_0008 update version to 445529b2
    I: 1007830 [ZCL]Event: move from 171
    I: 1007833 [ZCL] to 170 
    I: 1007835 [ZCL](diff -1)
    D: 1009510 [DMG]Endpoint 1, Cluster 0x0000_0008 update version to 445529b3
    I: current uptime 1009517
    I: setting new dimmer brightness 0..254 170
    D: 1009524 [DMG]Endpoint 1, Cluster 0x0000_0005 update version to 108217ca
    D: 1009531 [DMG]Endpoint 1, Cluster 0x0000_0008 update version to 445529b4
    I: 1009538 [ZCL]Event: move from 170
    I: 1009541 [ZCL] to 169 
    I: 1009544 [ZCL](diff -1)
    D: 1009562 [DMG]Endpoint 1, Cluster 0x0000_0008 update version to 445529b5
    I: current uptime 1009569
    I: setting new dimmer brightness 0..254 169
    D: 1009577 [DMG]Endpoint 1, Cluster 0x0000_0005 update version to 108217cb

    I tried to use different methods to debug and see what is going on during these 1.5 seconds of app 'freeze':
    - standard debugging via VS Code debugging is not viable because any breakpoint or CPU halt will crash my application without any useful data.
    - I tried to implement SEGGER SystemView (as Monitor Mode Debugging is not supported by NCS), but connection with my target crashes every time my device receives Matter command (J-Link SystemView overflow).

    Dimmer driver uses 1 software interrupt called by pin change every 10ms, timing functions showed execution time ~50us. It uses NRFX drivers to configure hardware peripherals. Resources allocated: PPI channel 15, GPIOTE, TIMER0 and EGU0.

    Buzzer driver uses one k_timer instance for timing the start and the end of PWM generation for buzzer. PWM generation uses standard PWM driver.

    Buttons, LEDs etc. do not affect this issue, as I already tested my app without initialised peripheral handling and still observed this behavior.

    I'm waiting for your response.


    Best wishes,
    Denys 

  • Hi Denys,

    Thank you for the explanation on how you have tested. I am not able to test this today myself, but I will try to reproduce it and test tomorrow.

    How are you checking/verifying that the application freezes? Is this based on the timing from the logs (such as with "current uptime %lld", k_uptime_get()), or are you seeing it from other behavior as well? Unless you have configured otherwise, logs will be buffered and processed at a later time. You can test by making the application log immediately in the context of the call by setting CONFIG_LOG_MODE_IMMEDIATE=y.

    Best regards,

    Marte

  • Hello,

    I should have said this earlier, but I am noticing this behavior only sometimes, less frequently with dimmer algorithm off. To replicate it, I am sending Move To Level command each time the last one successfully finishes its execution. The number of consecutive Move To Level commands is random: in general there are 2-10 commands between freezes. Also, I specifically used different time intervals between commands (up to a minute between commands)

    My application handler, each time it is called by MatterPostAttributeChangeCallback, calls:
    - start of PWM generation for buzzer (PWM generation is stopped through k_timer timeout callback); after PWM generation stop k_timer instance starts again for the same period for blanking period, when buzzer cannot be controlled.
    - configuring parameters for my dimmer algorithm, which uses beforementioned peripherals. PPI channels used are allocated through nrfx_ppi_channel_alloc function;
    - my application also uses address LEDs, I2C devices and UART for logging, peripherals used for this functionality are UART0, SPI2 and TWI1 (specifically assigned so that there is no shared registed addresses between them)

    During these application freezes every possible app functionality is broken:
    - buzzer call function is not called
    - dimmer algorithm, that uses EGU0 IRQ handler with priority 2 for synchronization with input signal, loses synchronization

    I tested again with  CONFIG_LOG_MODE_IMMEDIATE=y and noticed the same behaviour (I am using Termite as a COM port terminal, so there is no colors or font changes).

    [1B][1;32muart:~$ [1B][m[1B][8D[1B][J[00:07:13.637,969] [1B][0m<dbg> chip: LogV: [DMG]Endpoint 1, Cluster 0x0000_0005 update version to bad3b254[1B][0m
    [1B][1;32muart:~$ [1B][m[1B][8D[1B][J[00:07:13.649,902] [1B][0m<dbg> chip: LogV: [DMG]Endpoint 1, Cluster 0x0000_0008 update version to 5012241e[1B][0m
    [1B][1;32muart:~$ [1B][m[1B][8D[1B][J[00:07:13.661,865] [1B][0m<inf> chip: [ZCL]Event: move from 146[1B][0m
    [1B][1;32muart:~$ [1B][m[1B][8D[1B][J[00:07:13.669,952] [1B][0m<inf> chip: [ZCL] to 145 [1B][0m
    [1B][1;32muart:~$ [1B][m[1B][8D[1B][J[00:07:13.676,879] [1B][0m<inf> chip: [ZCL](diff -1)[1B][0m
    [1B][1;32muart:~$ [1B][m[1B][8D[1B][J[00:07:15.436,401] [1B][0m<dbg> chip: LogV: [DMG]Endpoint 1, Cluster 0x0000_0008 update version to 5012241f[1B][0m
    [1B][1;32muart:~$ [1B][m[1B][8D[1B][J[00:07:15.448,211] [1B][0m<inf> app: current uptime 435448[1B][0m
    [1B][1;32muart:~$ [1B][m[1B][8D[1B][J[00:07:15.455,718] [1B][0m<inf> app: setting new dimmer brightness 0..254 145[1B][0m
    [1B][1;32muart:~$ [1B][m[1B][8D[1B][J[00:07:15.464,843] [1B][0m<inf> buzzer: called? 0[1B][0m
    [1B][1;32muart:~$ [1B][m[1B][8D[1B][J[00:07:15.471,893] [1B][0m<dbg> chip: LogV: [DMG]Endpoint 1, Cluster 0x0000_0005 update version to bad3b255[1B][0m
    [1B][1;32muart:~$ [1B][m[1B][8D[1B][J[00:07:15.483,825] [1B][0m<dbg> chip: LogV: [DMG]Endpoint 1, Cluster 0x0000_0008 update version to 50122420[1B][0m

  • Hi,

    I have not been able to reproduce this issue yet.

    I asked the developers, and the only issue they are aware of is the following, where Matter's MoveToLevel implementation is broken due to assuming that a single tick takes a very short time, which is usually not the case on constrained devices:

    https://github.com/project-chip/connectedhomeip/issues/24193

    This issue might be causing the MoveToLevel command to execute for much longer than specified in the argument.

    Can you test with the light bulb sample and see if you are able to reproduce the issue there?

    Best regards,

    Marte

Related