This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Call to mqtt_publish blocking?

I have a hard-to-reproduce bug in my MQTT application that I have investigated enough to realize that my MQTT thread is stuck in a call to mqtt_publish().  I've verified that no other threads make any calls into MQTT API functions or reference the MQTT library's fd handle. 

Are there situations where a call to mqtt_publish are *expected* to block?  I read into the library a bit and the only thing I see obviously is the mqtt_mutex_lock which shouldn't be my problem since all MQTT library calls are coming from one thread.

In my case, I do have quite a lot of unsolicited modem status messages being handled in a different thread via it's own fd handle that was opened like this:

at_socket_fd = socket(AF_LTE, 0, NPROTO_AT);

Do I need to treat the MQTT and AT handles as "one" for some purposes?  I had assumed not, but I might have missed something and I have a hunch that my lockup might be happening when both threads are accessing their handles at the same time.  I haven't proven this yet, I'm just suspicious since the lockup started happening vaguely around the same time I enabled all the unsolicited AT reporting.

My setup is using modem firmware mfw_nrf9160_1.0.0 and NRF v0.4.0.  I was waiting for NRF v1.0.0 before jumping up since I'm going to have to rewrite my FOTA code to the new library.  I'll likely be doing that sometime this week.

Mildly related: the MQTT library logging is way too quiet on LOG_INF and way too noisy (and buggy) on LOG_DBG.  By "buggy" I mean that it makes a lot of assumptions about terminated strings that are wrong.

  • Hi,

     

    Are there situations where a call to mqtt_publish are *expected* to block?  I read into the library a bit and the only thing I see obviously is the mqtt_mutex_lock which shouldn't be my problem since all MQTT library calls are coming from one thread.

    Yes, there are certain scenarios where it will block, but only if there's a bug in the mutex handling.

    If you look into the function:

    int mqtt_publish(struct mqtt_client *client,
    		 const struct mqtt_publish_param *param)
    {
    	int err_code;
    	const u8_t *packet;
    	u32_t packetlen;
    
    	NULL_PARAM_CHECK(client);
    	NULL_PARAM_CHECK(param);
    
    	MQTT_TRC("[CID %p]:[State 0x%02x]: >> Topic size 0x%08x, "
    		 "Data size 0x%08x", client, client->state,
    		 param->message.topic.topic.size,
    		 param->message.payload.len);
    
    	mqtt_mutex_lock();
    
    	err_code = verify_tx_state(client);
    	if (err_code == 0) {
    		err_code = publish_encode(client, param, &packet, &packetlen);
    
    		if (err_code == 0) {
    			err_code = client_write(client, packet, packetlen);
    		}
    	}
    
    	mqtt_mutex_unlock();
    
    	MQTT_TRC("[CID %p]:[State 0x%02x]: << result 0x%08x",
    			 client, client->state, err_code);
    
    	return err_code;
    }

     

    The two interesting calls here are mqtt_mutex_lock and _unlock, which expand to:

    /**@brief Acquire lock on the module specific mutex, if any.
     *
     * @details This is assumed to be a blocking method until the acquisition
     *          of the mutex succeeds.
     */
    static inline void mqtt_mutex_lock(void)
    {
    	(void)k_mutex_lock(&mqtt_mutex, K_FOREVER);
    }
    
    /**@brief Release the lock on the module specific mutex, if any.
     */
    static inline void mqtt_mutex_unlock(void)
    {
    	k_mutex_unlock(&mqtt_mutex);
    }

     

    Locking a mutex that is already locked will make you hang tight until it's unlocked:

    https://docs.zephyrproject.org/1.9.0/kernel/synchronization/mutexes.html#locking-a-mutex

    I do not see any scenarios where you're potentially deadlocked (but cannot rule it out).

    Do you know how you end up in this scenario?

    Using v0.4.0 together with modem fw 1.0.0 does not sound like a good match. You should atleast update bsdlib to the matching revision, as modem and bsdlib is very tightly coupled.

     

    Kind regards,

    Håkon

  • Still not sure how I end up in this scenario.  It's just a routine publish like any other.  Our application generates them maybe every couple minutes, and there's nothing special about the data in this one.  I just leave the system running long enough and eventually one of them blocks.

    I will upgrade to v1.0.0 and try to reproduce this again sometime this week, but I'm reasonably sure that I was having the same issue while on v0.4.0 and the prior modem firmware as well.

    I added some extra LOG_INF through mqtt_publish so that a clean run looks like this:

    [00:01:40.986,450] <inf> net_mqtt: [CID 0x20020300]:[State 0x06]: >> Topic size 0x00000023, Data size 0x000001fe
    [00:01:40.986,480] <inf> net_mqtt: Got lock!
    [00:01:40.986,511] <inf> net_mqtt: publish_encode=0
    [00:01:40.987,243] <inf> net_mqtt: client_write=0
    [00:01:40.988,342] <inf> net_mqtt: client_write=0
    [00:01:40.988,372] <inf> net_mqtt: [CID 0x20020300]:[State 0x06]: << result 0x00000000

    I was mostly trying to discover which step it was blocking on.

    I got it to die again overnight, and it was not at the mutex.  Even more curious, it apparently did eventually return from mqtt_publish about an hour later.  Usually when I'm at the bench I only give it a few minutes before I power cycle.  Note the time stamps in this log:

    [05:46:12.394,348] <inf> net_mqtt: [CID 0x20020300]:[State 0x06]: >> Topic size 0x00000028, Data size 0x00000015
    [05:46:12.394,378] <inf> net_mqtt: Got lock!
    [05:46:12.394,409] <inf> net_mqtt: publish_encode=0
    [... all of my debug code triggered here for the stuck thread ...]
    [05:54:15.280,029] <inf> net_mqtt: client_write=0
    [06:51:55.478,759] <inf> net_mqtt: client_write=0
    [06:51:55.478,759] <inf> net_mqtt: [CID 0x20020300]:[State 0x06]: << result 0x00000000

    Let me bring everything up to the latest BSD libraries and see if that changes anything.

  • That's a large delay.

    Looks like you're stuck in the bsd call to send (mqtt_transport_write). The transport sends in no flags to the send() function, meaning that it would block (unless setsockopt() with non-blocking has been enabled), and it looks like it then uses just shy of 1 hour to complete. Let me know if you still see issues after upgrading bsdlib.

     

    Kind regards,

    Håkon

  • I haven't had the issue at all since I upgraded the NRFX BSD library to 1.0.  It was likely due to either a mis-match between the BSD library and modem firmware, or simply a bug that was fixed in the newer BSD library.

  • Håkan, how do you enable MQTT_TRC so you get printouts of that kind?

Related