10% failure rate of individual DM subsystem ranging calculations

Summary
I'm finding the DM subsystem's individual ranging calculations fail around 10% of the time (in line-of-sight conditions, around one meter apart). Is this expected?

Test Setup
I have two nRF5340's running a (slightly) modified version of the "nrf_dm" sample with all of the default Kconfig options and Bluetooth scan/advertising settings, except for the following:

CONFIG_DM_TIMESLOT_QUEUE_COUNT_SAME_PEER=1
CONFIG_LOG_MODE_DEFERRED=y

I'm doing one-directional ranging and using MAC address filters to know when to start ranging.
Environment is a typical office space with 10-20 nearby laptops and a couple of Wi-Fi routers
The nRF5340 on the left is the "reflector" and is only advertising.
The nRF5340 on the right is the "initiator" and is only scanning.
Most importantly: The application logic ensures only one ranging calculation happens every 3 seconds:
- the "reflector" turns on advertising (timeout = 1000ms) and starts a 3 second timer
- the "reflector" stops advertising after the first call to dm_request_add()
- when the timer expires, the "reflector" starts advertising again and the process repeats

Test Results

I get slightly different results depending on which device's log output I analyze. You can see from the data below that I tracked the failure types to help isolate ranging failures and ignore synchronization failures (e.g. "reflector" advertisement was never scanned by the "initiator").

From the "initiator" perspective:

Total Failures: 111

Ranging Failures: 58 i.e. data_ready() never called

CRC Failures: 15 i.e. Quality == "crc fail"

Scanning Failures: 38 i.e. scan_filter_match()/data_cb() never called

Ranging Requests: 699 i.e. dm_request_add() was called

Ranging Successes: 626 i.e. Quality == "ok"

Looking at just (Ranging Failures) / (Ranging Requests): 8.3% failure rate
Looking at (Ranging Failures + CRC Failures) / (Ranging Requests): 10.4% failure rate

From the "reflector" perspective:

Total Failures: 110

Ranging Failures: 74 i.e. data_ready() never called

CRC Failures: 11 i.e. Quality == "crc fail"

Scanning Failures: 25 i.e. adv_scanned_cb() never called

Ranging Requests: 712 i.e. dm_request_add() was called

Ranging Successes: 627 i.e. Quality == "ok"

Looking at just (Ranging Failures) / (Ranging Requests): 10.4% failure rate
Looking at (Ranging Failures + CRC Failures) / (Ranging Requests): 11.9% failure rate

Logic Analyzer

I'm seeing two main types of "ranging" failures in the logic analyzer. Note that I added a "success" GPIO that is toggled when the data_ready() callback indicates a ranging event completed.

Success

For comparison purposes, you can see both "reflector" (blue) and "initiator" (red) get the dm request, start ranging, and get a result.

Failure Mode #1 - Short Ranging Window

The time spent "ranging" on both the "reflector" (blue) and "initiator" (red) is truncated in this failure mode. You can see the ranging window is within a few milliseconds, and neither device gets results.

Failure Mode #2 - Long Ranging Window

The time spent "ranging" on both the "reflector" (blue) and "initiator" (red) is NOT truncated in this failure mode. You can see the ranging window is within a few dozen milliseconds, and neither device gets results.

Parents

0 tesc over 1 year ago

Hi,

Some of the issues that you see are in the synchronization (BLE) stage. Successful synchronization requires the devices to make contact (advertising advertising at the same time as the scanner is scanning) and successfully transfer three packets (advertisement, scan request, and scan response.)

The nrf_dm sample uses the following default settings:

Scanner (static struct bt_le_scan_param scan_param):
Scan interval: 60 ms. (BT_GAP_SCAN_FAST_INTERVAL, value of 0x60, unit is 0.625 ms.)
Scan window: 30 ms. (BT_GAP_SCAN_FAST_WINDOW, value of 0x30, unit is 0.625 ms.)

This provides a scan duty cycle of 50 %. Increasing the window, e.g. to 60 ms (same as the interval) would give 100 % scan dyty cycle, which should significantly increase reception of advertising packets.

Advertiser (struct bt_le_adv_param adv_param_nonconn):
Minimum advertising interval: 100 ms. (BT_GAP_ADV_FAST_INT_MIN_2, value of 0xA0, unit is 0.625 ms.)
Maximum advertising interval: 150 ms. (BT_GAP_ADV_FAST_INT_MAX_2, value of 0xF0, unit is 0.625 ms.)

A lower advertising interval (e.g. BT_GAP_ADV_FAST_INT_MIN_1 and corresponding ..MAX..) leads to more advertisements sent within the timeout, which should increase the chance of getting through to the scanner.

Increasing scan duty cycle and advertising frequency should be fine, given that the devices do not both advertise and scan. (If they use both roles, then duty cycle / advertising interval needs to be relaxed enough that there is enough time for scheduling both activities, hence the settings used in the sample. With only one role used for each device, it should be safe to increase the settings for that role.)

You could also consider increasing the advertising timeout value, for providing more time for synchronization. However depending on the time required for the ranging, this could increase the risk of ranging not getting finished before the next synchronization starts, resulting in more ranging failures.

Regards,
Terje
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 cal.rubbo over 1 year ago in reply to tesc

Thanks Terje, this is helpful for the "synchronization" failures like you said.

However, my purpose making this ticket was to discuss the "ranging" failures I'm seeing. I described how I isolated the failure types, and my percent failure calculations (plus logic analyzer samples) demonstrate failures purely in the "ranging" stage (i.e. both devices have synchronized and submitted a dm_request_add(), ranging was performed, but results weren't received or the quality was 'crc failure').

Because ranging events are so power-consuming, a failure rate as high as 10% becomes quite significant! Especially considering the tests I performed were in such low-stress scenarios (i.e. one isolated ranging event at a time between one pair of devices).

Should we expect individual "ranging" events to fail this much? Are there ways to improve the success rates of individual "ranging" events?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 tesc over 1 year ago in reply to cal.rubbo

Hi,

What intrigues me the most is the type where data_ready() never gets called. In general I would expect the two to three seconds to be more than enough, from ranging starts until initializing the next synchronization. I have reached out internally to see if there are any known issues and/or what the expectation should be, and I hope to have some answers regarding this tomorrow.

Logs from the devices could reveal more details, regarding where and when the ranging fails, so that might be useful for further analysis.

Regards,
Terje
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 cal.rubbo over 1 year ago in reply to tesc
Here you go. In the zip file are my two projects "initiator" and "reflector" as well as log files and a log processing bash script.

04-09-24_hadm-ranging-failures.zip

04-09-24_hadm-ranging-failures$ ./proc.sh log_reflector.txt log_initiator.txt initiator_failure_scan: 32 initiator_failure_range: 96 initiator_failure_crc: 23 initiator_success: 807 initiator_requests: 926 reflector_failure_scan: 25 reflector_failure_range: 110 reflector_failure_crc: 16 reflector_success: 801 reflector_requests: 927

Also in the log file is a DSView log capture of 20 minutes of the above projects. You must use nRF5340's for both ends, because both projects depend on device tree overlays for a new GPIO p0.06 that is toggled when ranging results are ready.

Here's a screenshot, but you can also download DSView and open the capture file

note: initiator == G5, reflector == A7

Thanks for your help!
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

0 tesc over 1 year ago in reply to cal.rubbo

Hi,

Doing a short test on my desk, I have a significantly better success rate:

	Your test		My test
initiator
failure_scan	32	3.46 %	64	14.68 %
failure_range	96	10.37 %	3	0.69 %
failure_crc	23	2.48 %	2	0.46 %
success	807	87.15 %	431	98.85 %
requests	926	100.00 %	436	100.00 %

reflector
failure_scan	25	2.70 %	0	0.00 %
failure_range	110	11.87 %	5	1.44 %
failure_crc	16	1.73 %	5	1.44 %
success	801	86.41 %	338	97.13 %
requests	927	100.00 %	348	100.00 %

Please disregard my initiator_failure_scan number as that seems to mostly stem from my initator log starting at sequence 63. The real number is therefore 0 or 1.

My numbers for range and crc failures are both low (ballpark number around one percent, but a much longer test is needed for accurate numbers with so few occurences.) They also seem to be of similar size (but again small numbers mean this could be by chance.) From your logs I see you experience range failures of higher than 10 %, and range failures are 4-7 times as frequent as crc failures.

Since the range failures are missing data_ready(), they may stem from the following lines in dm.c:

                                } else {
                                        LOG_DBG("Ranging failed (nrf_dm status: %d)",
                                                                          dm_context.nrf_dm_status);
                                }

You should then get a log message printing a status code describing the type of error:

typedef enum {
        NRF_DM_STATUS_SUCCESS = 0,
        NRF_DM_STATUS_FAIL_SYNC = 1,
        NRF_DM_STATUS_EVENT_FAIL_TIMEOUT = 2,
        NRF_DM_STATUS_CONF_INVALID = 3
} nrf_dm_status_t;

(https://github.com/nrfconnect/sdk-nrfxlib/blob/39c4313f419d49a2f4d7fd14ad34da2b7ed8888e/nrf_dm/include/nrf_dm.h#L61C1-L66C19)

In order to investigate further, it would be great to know which of these (if any) are responsible for the failures. For that, you must enable logging on the debug level for the distance measurement module, and provide the log (or aggregated count after processing the log.)

Regards,
Terje

0 cal.rubbo over 1 year ago in reply to tesc

Unfortunately, I've moved on to other tasks so I can't spend time looking into this any more. However, your 1% failure rate is quite surprising to me, considering I repeated this test at my house and saw failure rates closer to 5%

Here's my results from that test (which I ran overnight), although unfortunately I don't have any logs like you're asking - just the statistics below.

initiator_failure: 634
initiator_failure_range: 476
initiator_failure_crc: 61
initiator_failure_scan: 97
initiator_success: 9449
initiator_request: 9986
initiator_total: 10083

Looking at just (Ranging Failures) / (Ranging Requests): 4.7% failure rate
Looking at (Ranging Failures + CRC Failures) / (Ranging Requests): 5.4% failure rate

reflector_failure: 967
reflector_failure_range: 736
reflector_failure_crc: 226
reflector_failure_scan: 5
reflector_success: 9116
reflector_request: 10078
reflector_total: 10083

Looking at just (Ranging Failures) / (Ranging Requests): 7.3% failure rate
Looking at (Ranging Failures + CRC Failures) / (Ranging Requests): 9.5% failure rate

Thanks for your help,

Cal
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 tesc over 1 year ago in reply to cal.rubbo

Hi,

I see that you still have a lot of range failures, but also crc and scan.

I am not quite able to read the version numbers and production date from the stickers on the boards. What are they, for the two boards? I can potentially check against known issues with SoC and/or DK versions.

I see on the photograph of your setup, that one DK rests on top of a breadboard. Antennas are affected by conductive material in their near field. It may be as simple as the copper in the breadboard changing the electrical surroundings of the antenna enough to affect RF performance. The antenna is at the edge of the DK, under the Nordic Semiconductor logo and text. Rotating the DK by 90 degrees, so that the antenna end sticks out by at least 1-2 cm (half an inch or so) from the breadboard below, should be enough to eliminate that potential error source.

I am afraid for further debugging you should either get better logs or do a debug session. Figuring out where exactly it fails is key to understanding why you see the bad performance, and consequently to solve it.

A BLE sniffer trace could also provide some information, but that would be for the BLE part only. (E.g. show timing of advertising, scan request and scan response, as well as CRC failures for those as seen by the sniffer.)

Regards,
Terje
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Reply

0 tesc over 1 year ago in reply to cal.rubbo

Hi,

I see that you still have a lot of range failures, but also crc and scan.

I am not quite able to read the version numbers and production date from the stickers on the boards. What are they, for the two boards? I can potentially check against known issues with SoC and/or DK versions.

I see on the photograph of your setup, that one DK rests on top of a breadboard. Antennas are affected by conductive material in their near field. It may be as simple as the copper in the breadboard changing the electrical surroundings of the antenna enough to affect RF performance. The antenna is at the edge of the DK, under the Nordic Semiconductor logo and text. Rotating the DK by 90 degrees, so that the antenna end sticks out by at least 1-2 cm (half an inch or so) from the breadboard below, should be enough to eliminate that potential error source.

I am afraid for further debugging you should either get better logs or do a debug session. Figuring out where exactly it fails is key to understanding why you see the bad performance, and consequently to solve it.

A BLE sniffer trace could also provide some information, but that would be for the BLE part only. (E.g. show timing of advertising, scan request and scan response, as well as CRC failures for those as seen by the sniffer.)

Regards,
Terje
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel

Children

No Data