This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

connectivity_bridge malfunctions when BLE client disconnects and reconnects

I noticed that when a BLE client disconnects and then reconnects to the, instead of receiving all data arriving over the UART from nRF9160 to nRF52840, there is a delay of several seconds after which a huge block of 2048 bytes is sent over to the BLE client. This size matches the BRIDGE_BUF_SIZE configuration. The issue is observed on both Thingy:91 using the sample provided as part of the precompiled firmware, and also on nRF9160 DK - I ported the sample to work on it, so I observe the same behaviour on both boards. I tried changing BRIDGE_BUF_SIZE, and since I observed that internally 182 bytes are used for BLE transmission, I used that value. Now parts of the NMEA data sent from the GPS sample is randomly scrambled, resulting in 40% or NMEA packets that fail basic validation and CRC.

This attached log has been recorded from nRF9160dk, using the modified connectivity_bridge, which I adapted to run on the DK.
The GPS sample has also been modified, so send a copy of the NMEA strings to its second UART, which on the modified board links to the nRF52840 chip and is forwarded to BLE (application) and USB (this log file).

The following changes have been made:
nrf/samples/nrf9160/gps/src/main.c
static void print_nmea_data(void)
{
for (int i = 0; i < nmea_string_cnt; ++i) {
printk("%s", nmea_strings[i]);
+
+ #if CONFIG_TRACING_TEST
+ TRACING_STRING("%s", nmea_strings[i]);
+ #endif
+
+ #if CONFIG_UART_BLE
+ int uart_ble_send(char * msg, uint16_t length);
+ uart_ble_send(nmea_strings[i], strlen(nmea_strings[i]));
+ #endif
}
}

nrf/applications/connectivity_bridge.nrf9160dk/src/modules/Kconfig
BRIDGE_BUF_SIZE=182

nrf/applications/connectivity_bridge.nrf9160dk/prj.conf
CONFIG_USB_DEVICE_PRODUCT="nrf9160dk UART"

Send a copy of the data being sent to BLE over the debug UART of nRF52840. The USB on nRF9160dk connects to an OpenWRT router which acts as a mobile UART to TCP bridge and is then recorded on a computer using netcat.

nrf/applications/connectivity_bridge_g/src/modules/ble_handler.c
static void bt_send_work_handler(struct k_work *work)
{
uint16_t len;
uint8_t *buf;
int err;
bool notif_disabled = false;

do {
len = ring_buf_get_claim(&ble_tx_ring_buf, &buf, nus_max_send_len);

err = bt_gatt_nus_send(current_conn, buf, len);
if (err == -EINVAL) {
notif_disabled = true;
len = 0;
} else if (err) {
len = 0;
}

+ if (len)
+ {
+ char msg[BLE_TX_BUF_SIZE];
+ memcpy(msg, buf, len);
+ msg[len] = '\0';
+ printk("%s", msg);
+ }


err = ring_buf_get_finish(&ble_tx_ring_buf, len);
if (err) {
LOG_ERR("ring_buf_get_finish: %d", err);
break;
}
} while (len != 0 && !ring_buf_is_empty(&ble_tx_ring_buf));

if (notif_disabled) {
/* Peer has not enabled notifications: don't accumulate data */
ring_buf_reset(&ble_tx_ring_buf);
}
}

The following line is not valid:
$GPGLL,4.31375,N,02445.01620,E,1,04,100.00,312.72,M,0,,*26

Both of the following lines have a valid checksum:
* the first line contains invalid data
* the second line is valid
$GPGGA,085500.23,4208.30949,N,02444.97N,02444.9766.16,217.38,M,0,,*23
$GPGGA,085500.23,4208.30949,N,02444.97660,E,1,10,1.16,217.38,M,0,,*23

Notice how a part of the packet is missing or replaced with data, from
another packet. 40% of the packets are discarded, because they:
* do not begin with $
* do not have * in position line_length - 3
* do not have a valid checksum

From the the remaining 60% of the packets which pass the above validation,
a few are discarded because they:
* have values out of range
* have missing or more than one . character in a float value
* have less or more characters per line of value than expected

In rare events, there are still packets that pass all validation, but
contain slightly invalid data, e.g. there could be a long spike on the map
when our application is drawing the path.

If BRIDGE_BUF_SIZE is reverted to its default value of 2048,
Once the BLE client disconnects and then reconnects, the connectivity bridge
starts working incorrectly and will send large blocks of 2048 bytes containg
multiple NMEA packets.

ble_handler.test.c is my attempt to add tracing code and investigate what
is wrong with the connectivity_bridge. It seems that the ble_handler is holding
all buffers, so the uart_handler has no space to store arriving data.
At some point the buffers are freed by BLE, but then UART sends another 2048 bytes
of data, which arrave at bt_send_work_handler(). It starts sending blocks of
nus_max_send_len=182 until the BLE driver chokes. At this point BLE transmission
stalls for a few seconds, and UART RX will drop some data. Next the process repeats.
I am sorry for not being able to provid more details, and this parragraph may not be
fully accurate, because the code is extremely hard to follow.

I need to make one very important note towards the Nordic SDK developers:
Please make things simple and reliable!
I cannot say if the above issue is due to the existing design patterns in Zephyr OS,
but I see that all design patters are extremely complex and hard to follow, even
for taks that are very simple to implement, such as read/write from UART or BLE,
I see that you write 400 lines of code for each!
Hence it comes to no surprise that such a complex code has bugs which are hard to
diagnose or resolve. For example, as part of my testing I needed to change the
gps sample for nRF9160dk, so that in addition to its normal output, it sends a copy
with only the NMEA strings to its second UART, which I routed to the nRF52840 chip to
forward over BLE. My expectation was that I can simply write the data to UART.
But instead of a one line solution, I was forced to dedicate 330 lines of code:
zephyr/boards/arm/nrf9160dk_nrf9160_g/uart_ble.c
with a lot of handling routines and thigs that should normally be handled by
the driver, and any decent operating system provides a simple to use interface.

Using Nordic SDK master

nrf
commit ecf5d334f1577cc7fca11ae7b5a7fb86f0ea757a

zephyr
commit 7d20f2ebf25991b2897b91275939f8d16d38513a

 4572.projects.7z

  • Could you open a new ticket about the last question. In this way we can keep stuff more organized and only focus on the connect/disconnect issue.

    I tested out commit 38b7dd13ce1a8e9b8b84808d9ad and was not able to reproduce the bug. If I use ecf5d334f1577cc7fca11ae, then I am able to reproduce it.

    Try the following:

    • cd <ncs location>/nrf
    • git checkout 38b7dd13ce1a8e9b8b84808d9ad56360
    • west update
    • Add the sample below (gps_thingy) to nrf/samples and build and flash it for thingy91_nrf9160ns 

    4278.gps_thingy.zip

    • cd into nrf/applications/connectivity_bridge  and build and flash it for thingy91_nrf52840
    • Connect to the app
      • You may have to attach the Thingy91 to the computer via USB and open This PC --> ZEPHYR_USB (E:) --> config.txt and set BLE_ENABLED=1 before ejecting/removing the USB
    • Then you should be able to disconnect/connect the BLE connection without things going haywire

    If you're still seeing the issue, or if you see it happen in some other circumstance, please get back to me and tell me how to reproduce it.

    Regarding the USB-issue, I'll look into this the next couple of days

    Best regards,

    Simon


    In hindsight I can see that we have been talking past each other the whole time. I didn't reallly understand what your issue was about until I actual reproduced it. Before that I though it the problem was that you lost all the data that was sent while being disconnected, which is obvious. My apologies for that.

  • Hello Simon! I am already at the commit you requested. It has a tag: tag: v1.4.99-dev1, and yes I am able to reproduce the original issue, as well as the new one, for which you requested me to create a new case. I'm glad you finally reproduced and understand what the issue is about! Note that guessing isn't going to lead us to success. We can't just jump from one commit to another, adapt and retest the entire project, every time you make a guess, only to discover that it makes no difference. You need to understand the root cause, and be certain that you have a working fix.

    I was already here:

    commit 38b7dd13ce1a8e9b8b84808d9ad56360dd35e0a0 (HEAD, tag: v1.4.99-dev1)
    Author: Ruth Fuchss <…@nordicsemi.no>
    Date: Wed Dec 2 15:58:06 2020 +0100

    doc: update production statement

    Update the statement on the start page to make it clear that
    dev tags are not supported for production.

  • BLE-UART issue

    Im not able to reproduce the UART-BLE issue on 38b7dd13ce1a8e9b8b848

    I am able to reproduce the UART-BLE issue on ecf5d334f1577cc7fca.

    It is strange that you see the issue on 38b7dd13ce1a8e9b8b848. Can you explain how to reproduce it? Do you let it stay on for a long time? Disconnect/connect many times? I will try some more tomorrow and see if I can reproduce it on 38b7dd13ce1a8e9b8b848.

    Explanation

    Sorry for not including any explanations in the last reply.

    I have not done in-depth research on why it works on 38b7dd13ce1a8e9b8b848, but I think I have an idea. When using ecf5d334f1577cc7fca, I would see that 2048 bytes was received immediately after BLE was turned on that I think messed things up.  After using 38b7dd13ce1a8e9b8b848, this should get fixed by https://github.com/zephyrproject-rtos/zephyr/pull/28672. I will do more research tomorrow to get more insight into this. 

  • When the old version is used, it is sufficient to disconnect and reconnect the BLE client to reproduce. With 38b7dd13ce1a8e9b8b84808d9ad56360dd35e0a0, however the same bug appears on its own while both BLE and USB are connected. It could take a few minutes up to a few hours. Then both BLE and UART receive packets of 2 KB with a pause in between. The bug persists for around 2-4 minutes, and after that the proper behaviour is restored.

    The video below is using 38b7dd13ce1a8e9b8b84808d9ad56360dd35e0a0. You can see how a message is cut in the middle, and then there is a long pause. Then a huge packet of messages suddenly appears over the USB-UART. Followed by another pause. This repeats a few times. At 3:44 normal operation is restored, and messages are forwarded as they arrive.

    www.youtube.com/watch

  • Thanks for the description on how to reproduce it on 38b7dd13ce1a8e9b8b84808d9ad56360dd35e0a0. I will test i out tomorrow and see if I can figure out the core of the issue.

Related