Help with streaming BLE audio with device as source

Hi there!

I am trying to make an application that plays an audio sample to a BLE speaker (another nRF5340 running as headset) where the audio sample is in memory. I have based my code on the nRF5340 BLE audio example. During development, I am simply including the audio buffer from a header file (generated from a hex dump using xxd), and have my raw PCM data (encoded as 16-bit 48kHz) ready.

My first approach which was giving promising results, was to "hijack" the BLE gateway example in the audio_system.c encoder thread next to where the test tone gets added. I simply gave myself a callback there which gave me a pointer to the buffer that was about to be encoded, and then I overwrote that data to the data from my sample instead with a memcpy() call. This very simple approach seemed to be working, but the processor could not keep up. I was getting very choppy audio and frequently getting "I2S RX overrun" messages. The audio was discernibly my sample though.

I investigated ways of being less intrusive to the example, so I rewrote the code to instead have my callback right at the I2S driver buffer switch (in the audio_datapath_i2s_blk_complete() function). Even though I feel I am shoving in my data right where the normal I2S driver is putting its PCM data, I got the same choppy audio. I did make sure that I kept passing the I2S driver a fresh "dummy" buffer to keep it satisfied while I passed the real rx_buf (from the FIFO) to my callback to put my data in.

In both of these approaches I did my best to minimize the code changes to the drivers (a handful of lines at most), but I continue to be unsuccessful.

I am starting to think my next approach is to completely get rid of the I2S side of things (the I2S chip is not even on my custom PCB for the gateway). Basically get rid of the entire data path and audio system setup, and just make calls to sw_codec_encode() and streamctrl_encoded_data_send() directly using a timer (knowing my sample rate and chunk size this should be pretty doable I imagine). But before I go down this path, I want to see if there is any advice on alternative approaches or getting my earlier attempts to work?

I know this application is not really what the BLE audio example was built for, but I feel this is a commonly desired use for it so hopefully I can get some help. Thanks!

  • Hi 

    Have you measured the runtime of the various steps in the mix_tone(..) function, to see if one or more of them could be running slower than you expect? 

    double math should generally be avoided, unless absolutely necessary. The Cortex M33 is a 32-bit architecture with no hardware dedicated to 64-bit floating point (double) processing, and double calculations will run very slow and require a lot of CPU number crunching. 32-bit floats on the other hand can take advantage of hardware acceleration, in addition to being less compute and memory intensive to start with.  

    For the note table I would recommend just storing a static table in RAM, and fill it with values at the start of your program. 

    Do you know if either of the tone_gen(..), contin_array_create(..) or pcm_mix(..) functions use double? 

    Best regards
    Torbjørn

  • Hi Torbjorn!

    Thanks for the info and sorry for the slow reply. I switched to 32-bit floats, but that seems to have had no noticeable impact. I ran through the entire datapath and verified there are no other instances of doubles.

    Currently, my note table is stored like so in a file piano.h:


    static unsigned char piano_wav[] = {
      0x52, 0x49, 0x46, 0x46, 0xa4, 0xb2, 0x00, 0x00, 0x57, 0x41, 0x56, 0x45,
      0x66, 0x6d, 0x74, 0x20, 0x10, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00,
      0x80, 0xbb, 0x00, 0x00, 0x00, 0x77, 0x01, 0x00, 0x02, 0x00, 0x10, 0x00,
      0x64, 0x61, 0x74, 0x61, 0x80, 0xb2, 0x00, 0x00, 0x0a, 0x00, 0x08, 0x00,
      0x01, 0x00, 0xfe, 0xff, 0xfb, 0xff, 0xff, 0xff, 0x01, 0x00, 0x0b, 0x00,
      0x09, 0x00, 0x0d, 0x00, 0x09, 0x00, 0x07, 0x00, 0x00, 0x00, 0x02, 0x00,
      0xfc, 0xff, 0xfe, 0xff, 0xfb, 0xff, 0xf9, 0xff, 0xfb, 0xff, 0xfc, 0xff,
      0x00, 0x00, 0x05, 0x00, 0x06, 0x00, 0x0a, 0x00, 0x08, 0x00, 0x0f, 0x00,
      0x09, 0x00, 0x13, 0x00, 0x04, 0x00, 0x0f, 0x00, 0xfe, 0xff, 0x03, 0x00,
      0xf6, 0xff, 0xf5, 0xff, 0xe9, 0xff, 0xe5, 0xff, 0xd8, 0xff, ...

    I then include this in my main app.h and index the array as needed. The array is very long (roughly 45kb). Zephyr is showing that I have 448 kb of ram, but I am wondering are there cache layers to this? Is there a performance penalty for accessing sections of this array compared to other variables?

    I will perform some in depth speed testing to get to the bottom of which operations are hurting me the most.

    Thanks for your continued support!

    EDIT: If accessing the array is in fact coming with a penalty, would it make sense to increase the packet size? I'm perfectly fine trading off less responsive audio if it means smoother playback. So perhaps instead of these really small chunks that need to be sent frequently, can we instead spread them out a bit giving my mixing logic more time to execute before the sending logic needs the cpu again? Less context switches generally means faster...

    From what I can tell, frame and block size seem to be configurable, though I'm not sure how, nor how big I can make them before the bluetooth driver will start to get angry. Can you give me some tips for adjusting these? For my purposes, even a response time of around 50ms is fine.

  • Hi Ben

    benwefers said:
    Zephyr is showing that I have 448 kb of ram, but I am wondering are there cache layers to this? Is there a performance penalty for accessing sections of this array compared to other variables?

    No, there is no cache for the RAM, it essentially runs at the same clock speed as the CPU, but there could be slow downs if multiple peripherals want to access the same RAM block at the same time. Still, I doubt this is the reason you experience slow downs here. 

    Normally you would use const data (flash) for large arrays like this, and then there will be a cache involved, but again I doubt this would slow you down significantly. 

    One way to make memory transfers more efficient is to treat your data as 32-bit values rather than 8-bit values, if possible, since the RAM bus can transfer 32-bits in a single cycle. 

    benwefers said:
    I will perform some in depth speed testing to get to the bottom of which operations are hurting me the most.

    Please do. As I mentioned earlier simple pin toggling is often the most direct way to measure the speed of low level functions like these. 

    benwefers said:
    EDIT: If accessing the array is in fact coming with a penalty, would it make sense to increase the packet size?

    Generally, yes, the larger you make your buffers the more efficient the processing will be, but it is a matter of diminishing returns as you make the buffers very large. Do you know what the size of your buffers are currently? 

    Best regards
    Torbjørn

  • Hi Torbjorn!

    Thanks for the quick reply and again for all the info. I will do more testing and hopefully have some more details later today. I currently am running everything with default configuration, which from what I can tell is 192-byte blocks and 10 blocks per frame.

    In my program, I am manually triggering the audio_usb.c data_received() callback every 1000us with 192 bytes. From what I can tell, the audio_system.c task is just waiting and collecting 10 of these blocks before encoding them and sending, so I think if I can drop the block number down to just 1 and then pass in all 1920 bytes at once, that would give me a lot more processing time where I don't have to worry about yielding the cpu to the audio_system.c task. Would that make sense?

    Thanks

    Ben

    EDIT: I just tried this in a simplistic way by changing my app logic to the following:

        if (empty) {
            get_next_block(frame, FRAME_SIZE_BYTES);
            empty = false;
        }
    
        // Send data block
        if (timer < usTime()) {
            if (timer < usTime() - 2500) LOG_ERR("Missed deadline by %d us", usTime() - timer);
    		for (int i = 0; i < CONFIG_FIFO_FRAME_SPLIT_NUM; i++)
            	audio_manual_send(&frame[BLOCK_SIZE_BYTES * i], BLOCK_SIZE_BYTES);
            timer += 10000;
            empty = true;
        }

    Now instead of generating each block in 192-byte chunks, I generate it in 1920-byte chunks (the whole frame), and then queue up each subsection. This dramatically improved the audio quality (it is almost passable now)! Smiley

    The only issue is if I mix lots of tones (3 or more) it starts to get choppy again. I think I can improve performance further by increasing the frame size, and potentially bypassing the audio_system altogether by directly calling sw_codec_encode() and streamctrl_encoded_data_send(). Thoughts?

  • Hi Ben

    Sorry for the slow response. I don't have any immediate clever ideas, but have you tried to use the thread analyzer to figure out how much CPU time you are using, and which threads are using the most? 

    I guess most of the heavy number crunching happens within the same thread, so you won't be able to figure out exactly which operations are causing it, but at least it would give you an impression of how much CPU is used overall, and how much is used by your algorithms, compared to the other parts of the application. 

    Best regards
    Torbjørn

Related