This post is older than 2 years and might not be relevant anymore
More Info: Consider searching for newer posts

Call to mqtt_publish blocking?

I have a hard-to-reproduce bug in my MQTT application that I have investigated enough to realize that my MQTT thread is stuck in a call to mqtt_publish().  I've verified that no other threads make any calls into MQTT API functions or reference the MQTT library's fd handle. 

Are there situations where a call to mqtt_publish are *expected* to block?  I read into the library a bit and the only thing I see obviously is the mqtt_mutex_lock which shouldn't be my problem since all MQTT library calls are coming from one thread.

In my case, I do have quite a lot of unsolicited modem status messages being handled in a different thread via it's own fd handle that was opened like this:

at_socket_fd = socket(AF_LTE, 0, NPROTO_AT);

Do I need to treat the MQTT and AT handles as "one" for some purposes?  I had assumed not, but I might have missed something and I have a hunch that my lockup might be happening when both threads are accessing their handles at the same time.  I haven't proven this yet, I'm just suspicious since the lockup started happening vaguely around the same time I enabled all the unsolicited AT reporting.

My setup is using modem firmware mfw_nrf9160_1.0.0 and NRF v0.4.0.  I was waiting for NRF v1.0.0 before jumping up since I'm going to have to rewrite my FOTA code to the new library.  I'll likely be doing that sometime this week.

Mildly related: the MQTT library logging is way too quiet on LOG_INF and way too noisy (and buggy) on LOG_DBG.  By "buggy" I mean that it makes a lot of assumptions about terminated strings that are wrong.

Parents
  • Hi,

     

    Are there situations where a call to mqtt_publish are *expected* to block?  I read into the library a bit and the only thing I see obviously is the mqtt_mutex_lock which shouldn't be my problem since all MQTT library calls are coming from one thread.

    Yes, there are certain scenarios where it will block, but only if there's a bug in the mutex handling.

    If you look into the function:

    int mqtt_publish(struct mqtt_client *client,
    		 const struct mqtt_publish_param *param)
    {
    	int err_code;
    	const u8_t *packet;
    	u32_t packetlen;
    
    	NULL_PARAM_CHECK(client);
    	NULL_PARAM_CHECK(param);
    
    	MQTT_TRC("[CID %p]:[State 0x%02x]: >> Topic size 0x%08x, "
    		 "Data size 0x%08x", client, client->state,
    		 param->message.topic.topic.size,
    		 param->message.payload.len);
    
    	mqtt_mutex_lock();
    
    	err_code = verify_tx_state(client);
    	if (err_code == 0) {
    		err_code = publish_encode(client, param, &packet, &packetlen);
    
    		if (err_code == 0) {
    			err_code = client_write(client, packet, packetlen);
    		}
    	}
    
    	mqtt_mutex_unlock();
    
    	MQTT_TRC("[CID %p]:[State 0x%02x]: << result 0x%08x",
    			 client, client->state, err_code);
    
    	return err_code;
    }

     

    The two interesting calls here are mqtt_mutex_lock and _unlock, which expand to:

    /**@brief Acquire lock on the module specific mutex, if any.
     *
     * @details This is assumed to be a blocking method until the acquisition
     *          of the mutex succeeds.
     */
    static inline void mqtt_mutex_lock(void)
    {
    	(void)k_mutex_lock(&mqtt_mutex, K_FOREVER);
    }
    
    /**@brief Release the lock on the module specific mutex, if any.
     */
    static inline void mqtt_mutex_unlock(void)
    {
    	k_mutex_unlock(&mqtt_mutex);
    }

     

    Locking a mutex that is already locked will make you hang tight until it's unlocked:

    https://docs.zephyrproject.org/1.9.0/kernel/synchronization/mutexes.html#locking-a-mutex

    I do not see any scenarios where you're potentially deadlocked (but cannot rule it out).

    Do you know how you end up in this scenario?

    Using v0.4.0 together with modem fw 1.0.0 does not sound like a good match. You should atleast update bsdlib to the matching revision, as modem and bsdlib is very tightly coupled.

     

    Kind regards,

    Håkon

Reply
  • Hi,

     

    Are there situations where a call to mqtt_publish are *expected* to block?  I read into the library a bit and the only thing I see obviously is the mqtt_mutex_lock which shouldn't be my problem since all MQTT library calls are coming from one thread.

    Yes, there are certain scenarios where it will block, but only if there's a bug in the mutex handling.

    If you look into the function:

    int mqtt_publish(struct mqtt_client *client,
    		 const struct mqtt_publish_param *param)
    {
    	int err_code;
    	const u8_t *packet;
    	u32_t packetlen;
    
    	NULL_PARAM_CHECK(client);
    	NULL_PARAM_CHECK(param);
    
    	MQTT_TRC("[CID %p]:[State 0x%02x]: >> Topic size 0x%08x, "
    		 "Data size 0x%08x", client, client->state,
    		 param->message.topic.topic.size,
    		 param->message.payload.len);
    
    	mqtt_mutex_lock();
    
    	err_code = verify_tx_state(client);
    	if (err_code == 0) {
    		err_code = publish_encode(client, param, &packet, &packetlen);
    
    		if (err_code == 0) {
    			err_code = client_write(client, packet, packetlen);
    		}
    	}
    
    	mqtt_mutex_unlock();
    
    	MQTT_TRC("[CID %p]:[State 0x%02x]: << result 0x%08x",
    			 client, client->state, err_code);
    
    	return err_code;
    }

     

    The two interesting calls here are mqtt_mutex_lock and _unlock, which expand to:

    /**@brief Acquire lock on the module specific mutex, if any.
     *
     * @details This is assumed to be a blocking method until the acquisition
     *          of the mutex succeeds.
     */
    static inline void mqtt_mutex_lock(void)
    {
    	(void)k_mutex_lock(&mqtt_mutex, K_FOREVER);
    }
    
    /**@brief Release the lock on the module specific mutex, if any.
     */
    static inline void mqtt_mutex_unlock(void)
    {
    	k_mutex_unlock(&mqtt_mutex);
    }

     

    Locking a mutex that is already locked will make you hang tight until it's unlocked:

    https://docs.zephyrproject.org/1.9.0/kernel/synchronization/mutexes.html#locking-a-mutex

    I do not see any scenarios where you're potentially deadlocked (but cannot rule it out).

    Do you know how you end up in this scenario?

    Using v0.4.0 together with modem fw 1.0.0 does not sound like a good match. You should atleast update bsdlib to the matching revision, as modem and bsdlib is very tightly coupled.

     

    Kind regards,

    Håkon

Children
No Data
Related