inaccurate gcc nrf_delay_us

Question

There are a couple problems with the SDK 7.1.0 implementation of nrf_delay_us for GCC. The experiments I've run show the provided version generates delays 40-50% too long, as measured by before/after captures of TIMER0 running with undivided HFCLK. 
 First, the "static inline" technique does not guarantee inline on gcc. Inlining is critical for the intended delay to be exact. You need to force GNU inline semantics , and add an attribute that makes GCC inline even when not optimizing. (Below I've done that in a way that works with -std=c99 .) 
 Second, implementing the loop control in C instead of assembly also makes the timing dependent on optimization levels. 
 Third, there are two too many NOPs in the loop body, compared to the
other assembly variants. 
 The code below generates exact delays for me using gcc-arm-none-eabi-4_9-2014q4 for power-of-two (1..2048) delays. There is a constant 7 clock overhead, which probably includes triggering a capture. 
 extern void inline
__attribute__((__gnu_inline__,__always_inline__))
nrf_delay_us(uint32_t volatile number_of_us)
{
 __ASM volatile (
 "1:	SUB %0, %0, #1
	"
 "NOP
	"
 "NOP
	"
 "NOP
	"
 "NOP
	"
 "NOP
	"
 "NOP
	"
 "NOP
	"
 "NOP
	"
 "NOP
	"
 "NOP
	"
 "NOP
	"
 "NOP
	"
 "BNE 1b
	"
 : "+r"(number_of_us)
 );
}

pabigot · Accepted Answer

This wasn't a question, but the devzone API apparently lacks a category for "feedback to Nordic, please fix this sometime." 
 SDK 8.0.0 has a reworked nrf_delay_us that combines elements of both solutions in this question. I haven't tested it but it looks plausible.