LED Strip - SPI failures at ~250 LEDs?

Hi all.

I have made myself a board which has a module with the nRF52833 chip embedded onto it. It's effectively a board with a set of switching regulators on it to drive multiple strings of LEDs. I have been testing the board with relatively small strings of SK6812 RGBW LEDs - up to 120 or so - and all has been working fine. Today I have received a longer strip of LEDs and I'm apparently having issues driving it.

The code is relatively simple, I am generating pixels and sending them towards the strip using the ws2812_spi driver:

ret = led_strip_update_rgb(strips[i]->device, strips[i]->buffer, offset); // Update physical


An example of my config (there's one for each of the four SPI buses):

&spi2 {
    compatible="nordic,nrf-spim";
    status = "okay";

    pinctrl-0 = <&spi2_default>;
    pinctrl-1 = <&spi2_sleep>;
    pinctrl-names = "default", "sleep";

    led_strip2: ws2812@2 {
        compatible = "worldsemi,ws2812-spi";

        /* SPI */
        reg = <2>; /* ignored, but necessary for SPI bindings */
        spi-max-frequency = <SPI_FREQ>;

        /* WS2812 */
        chain-length = <300>; /* arbitrary; change at will */
        color-mapping = <LED_COLOR_ID_GREEN
                 LED_COLOR_ID_RED
                 LED_COLOR_ID_BLUE
                 LED_COLOR_ID_WHITE
                 >;
        spi-one-frame = <ONE_FRAME>;
        spi-zero-frame = <ZERO_FRAME>;
    };
};
The issue is that when I try and update a longer strip - up to around 240 pixels works fine but somewhere around 250 it fails - something is locking on a mutex or otherwise dying. Note that the LED strip physically is correctly updated to the # of pixels I requested, including my maximum string length of 300... but the board/OS is then locked up.
If I debug the code, wait for it to fail, and breakpoint - no fatal error happens "naturally" - then the main thread call-stack is stuck here:

arch_swap(unsigned int key) (c:\ncs\v2.6.0\zephyr\arch\arm\core\cortex_m\swap.c:48)
spi_context_wait_for_completion(struct spi_context * ctx) (c:\ncs\v2.6.0\zephyr\drivers\spi\spi_context.h:169)
transceive(const struct device * dev, const struct spi_config * spi_cfg, const struct spi_buf_set * tx_bufs, const struct spi_buf_set * rx_bufs, _Bool asynchronous, spi_callback_t cb, void * userdata) (c:\ncs\v2.6.0\zephyr\drivers\spi\spi_nrfx_spim.c:446)
spi_nrfx_transceive(const struct device * dev, const struct spi_config * spi_cfg, const struct spi_buf_set * tx_bufs, const struct spi_buf_set * rx_bufs) (c:\ncs\v2.6.0\zephyr\drivers\spi\spi_nrfx_spim.c:485)
spi_transceive(const struct spi_buf_set * rx_bufs, const struct spi_buf_set * tx_bufs, const struct spi_config * config, const struct device * dev) (c:\V\led_board_testing\build\zephyr\include\generated\syscalls\spi.h:38)
spi_write(const struct spi_buf_set * tx_bufs, const struct spi_config * config, const struct device * dev) (c:\ncs\v2.6.0\zephyr\include\zephyr\drivers\spi.h:837)
spi_write_dt(const struct spi_dt_spec * spec, const struct spi_buf_set * tx_bufs) (c:\ncs\v2.6.0\zephyr\include\zephyr\drivers\spi.h:855)
ws2812_strip_update_rgb(const struct device * dev, struct led_rgb * pixels, size_t num_pixels) (c:\ncs\v2.6.0\zephyr\drivers\led_strip\ws2812_spi.c:152)
led_strip_update_rgb(size_t num_pixels, struct led_rgb * pixels, const struct device * dev) (c:\ncs\v2.6.0\zephyr\include\zephyr\drivers\led_strip.h:105)
For clarity, the exact line it's gone away on is:
if (k_sem_take(&ctx->sync, timeout)) {

And I'm getting an exception only when I breakpoint, on the logging thread (which seems arbitrary, as if I entirely remove logging, I get the exception instead on the gpio_handler thread):
k_sys_fatal_error_handler(unsigned int reason, const z_arch_esf_t * esf) (c:\ncs\v2.6.0\zephyr\kernel\fatal.c:41)
<signal handler called> (Unknown Source:0)
<signal handler called> (Unknown Source:0)
mpsc_pbuf_free(struct mpsc_pbuf_buffer * buffer, const union mpsc_pbuf_generic * item) (c:\ncs\v2.6.0\zephyr\lib\os\mpsc_pbuf.c:577)
msg_free(struct mpsc_pbuf_buffer * buffer, const union log_msg_generic * msg) (c:\ncs\v2.6.0\zephyr\subsys\logging\log_core.c:739)
log_process() (c:\V\led_board_testing\build\zephyr\include\generated\syscalls\log_ctrl.h:57)
log_process_thread_func(void * dummy1, void * dummy2, void * dummy3) (c:\ncs\v2.6.0\zephyr\subsys\logging\log_core.c:908)

Also note that I have trivially modified a few SPI files to add logging, and the ws2812_spi.c driver to override the buffer length based on num_pixels instead the length preconfigured in the config (i.e.buf.len = cfg->num_colors * 8 * num_pixels; ), so line references in the call stacks may not line up.
Any hints for debugging/diagnosing or potential hardware limitations I'm hitting are welcome (I don't believe this to be a power issue as all the LEDs are correctly updated and lit despite SPI not returning, and power draw is ~1 amp with the regulator rated at 5 and my PSU rated at 10).
Versions:
NCS 2.6.0; custom dts based on this:
#include <nordic/nrf52833_qiaa.dtsi>
#include <zephyr/dt-bindings/led/led.h>
Thanks for reading.
  • pilux said:
    Good news on a Friday night: I have found a buffer overrun in my own (LED pixel generation) code and early indications are that this fixes my core issue.

    Great job!

    pilux said:
    Is it possible that I trashed something in memory and it survived a debug restart

    Yes, see Reset behavior. Soft resets do not reset RAM.

    pilux said:
    I'm concerned there's still something wrong and it's going to bite me randomly.

    Ah the joy of programming, to never know 100% for sure if that will happen.

    Consider adding a watchdog that will reset your device if it fails, that will usually catch the worst of issues and reset your device

    pilux said:
    I wasn't turning the device off and on, just modify code, build, debug

    It just goes to show that the age old advice " have you tried turning it off and on again" is not yet stale

Related