TWI/I2C bus / driver gets stuck occasionally

I'm working with a nRF52840 DK connected with SCL, SDA, GND, VDD to my custom PCB. I'm using the nRF5_SDK_17.1.0.ddde560. 

My apologies in advance if I missed anything obvious, as I am new to development with nRF.

I'm trying to communicate with the ADS1114 through a TCA9517A bus level shifter, as the ADS1114 has a supply voltage of 5V. The I2C communication often works, and seems to keep working, but sometimes the bus seems to get stuck. The project is based on the pca10056 twi_sensor example project, I mainly changed the registers to write.

I've looked at the other posts related to TWI problems, but I did not find a solution to this issue there, unfortunately.

I am using external pullups (~3k) and have disabled the internal pull-ups.

I analysed the I2C bus with a logic analyser and oscilloscope, but I feel like I lack the experience to figure out what exactly is going on, so I would really appreciate any insights into the problem or what steps I could try to fix it. 

Failed communication example

With the scope images of the final CLKs before communication fails (blue is SDA, purple is SCL):

Succesfull transaction example

The error doesn't always occur at this spot, but the cases look like this.

When this error occurs, the code gets stuck at the     'while (m_xfer_done == false);' that is after the transaction.

To me the biggest difference seems to be the pull down of the SCL signal to a voltage level lower than the other pull-down levels. There's also what appears to be a small 'glitch' in the SDA before this happens, but I am not sure if this is related. I looked at the ADS1114 datasheet, and it says the device does not implement clock stretching nor does it drive the SCL pin. It is almost as if the master just stops driving the CLK.

I get the impression that if the initial transfer (configuration write, address pointer write, first few reads) is succesfull, it continues to work indefinitely. Therefore I might be able to just keep on retrying until it is succesfull by implementing a recovery system. Do you have any pointers to how I would succesfully implement this?

Source code

For some more context, here is the code of the main transactions (note, LM75B_ADDR is the ADS1114 address):

/* Mode for ADS1114. */
#define OPERATING_MODE_BYTE1 0b10000100 //0x84
#define OPERATING_MODE_BYTE2 0b10000011 //0x83

void ADS1114_CONFIG(void)
{
ret_code_t err_code;

/* Writing to LM75B_REG_CONF "0" set temperature sensor in NORMAL mode. */
uint8_t reg[3] = {LM75B_REG_CONF, OPERATING_MODE_BYTE1, OPERATING_MODE_BYTE2};
err_code = nrf_drv_twi_tx(&m_twi, LM75B_ADDR, reg, sizeof(reg), false);
APP_ERROR_CHECK(err_code);
while (m_xfer_done == false);

/* Writing to LM75B_REG_CONF "0" set temperature sensor in NORMAL mode. */
uint8_t reg2[1] = {0b00000011};
err_code = nrf_drv_twi_tx(&m_twi, LM75B_ADDR, reg2, sizeof(reg2), false);
APP_ERROR_CHECK(err_code);
while (m_xfer_done == false);
}

void twi_handler(nrf_drv_twi_evt_t const * p_event, void * p_context)
{
switch (p_event->type)
{
case NRF_DRV_TWI_EVT_DONE:
//if (p_event->xfer_desc.type == NRF_DRV_TWI_XFER_RX)
//{
//data_handler(m_sample);
//}
m_xfer_done = true;
break;
default:
break;
}
}

void twi_init (void)
{
ret_code_t err_code;

const nrf_drv_twi_config_t twi_lm75b_config = {
.scl = ARDUINO_SCL_PIN,
.sda = ARDUINO_SDA_PIN,
.frequency = NRF_DRV_TWI_FREQ_100K,
.interrupt_priority = APP_IRQ_PRIORITY_HIGH,
.clear_bus_init = true
};

err_code = nrf_drv_twi_init(&m_twi, &twi_lm75b_config, twi_handler, NULL);
APP_ERROR_CHECK(err_code);

nrf_drv_twi_enable(&m_twi);
}

static void read_sensor_data()
{
m_xfer_done = false;

/* Read 1 byte from the specified address - skip 3 bits dedicated for fractional part of temperature. */
//ret_code_t err_code = nrf_drv_twi_rx(&m_twi, LM75B_ADDR, &m_sample, sizeof(m_sample));
ret_code_t err_code = nrf_drv_twi_rx(&m_twi, LM75B_ADDR, samples, sizeof(samples)/sizeof(samples[0]));

APP_ERROR_CHECK(err_code);
}

Sidenotes

- Occasionally, the first write fails entirely and this could be seen on the scope, although I am not 100% sure if this is related to the issue (here, purple was a custom output pin set high at start):

- Sometimes if I keep the development kit running for a bit, I get a random J-link error. This might be because it is stuck waiting for the transaction to finish.

Some more context:

- I am using SEGGER Embedded Studio for ARM V7.30 on Windows 10

Any advice and help is appreciated. If more information is required, please let me know.

Kind regards,

Frederik

Parents
  • Hi,

     

    At the moment, you're only resetting the variable if the TWI transaction get's an ACK:

    void twi_handler(nrf_drv_twi_evt_t const * p_event, void * p_context)
    {
    switch (p_event->type)
    {
    case NRF_DRV_TWI_EVT_DONE:
    //if (p_event->xfer_desc.type == NRF_DRV_TWI_XFER_RX)
    //{
    //data_handler(m_sample);
    //}
    m_xfer_done = true;
    break;
    default:
    break;
    }
    }

    If you do not reset the flag on other error events, you will hang if those occur.

     

    Q1: How often do you get this NACK from the sensor?

    Q2: Have you tried reducing the TWI speed to see if this has an effect on the problem?

     

    Kind regards,

    Håkon

  • Dear Håkon, thank you for your swift reply.

    Ah, that makes sense. When I was debugging initially and stepping into the handler, it seemed that it would step into the m_xfer_done = true line, and still remain stuck. That gave me the impression that it handled it succesfully. I will add a reset to the flag in the other events and see if it stops hanging.

    Q1: The example did not handle the other events, so is it correct that the example basically assumed the transaction would never fail?

    As a sidenote, this is what every periodic read from the sensor, when everything works, looks like:

    I must admit that I am not sure whether the NAK is positive or negative, but since it 'works' I assumed it was positive.

    Every succesful write ends in an ACK

    A1: I would say that at every new debugging attempt, it's probably close to 50% failure rate (NACK). I now realise this is more often than the 'occassionally' as mentioned in the title. I must add that this debugging reset is without disconnecting the PCB with the ADC, so I might need to look into whether the ADC needs a proper reset before trying to reconfigure it.

    A2: Currently the speed is set to 100 kHz, the only other defined options were 250 kHz and 400 kHz. Can I lower the frequency by replacing the definition with a lower frequency?

    Kind regards,

    Frederik

  • As a quick test, I added the following to the twi_handler:

    default:
    m_xfer_done = true;
    break;
    }

    Unfortunately, the problem still occurs:

    And it remains stuck on the while (m_xfer_done == false); 

    edit: I don't think it's the lack of an ACK that is causing the issue. The last byte contains 8 CLK pulses while the others contain 9. I think something goes wrong before the handler is called.

  • Update 2.

    Some progress:

    I implemented a general I2C call with the reset command, as described in the ads1114 datasheet, and now I can repeatedly finish the first write transaction succesfully. I must admit, that I am not 100% sure why.

    The new issue is that, right after the first byte write the program hits a NRF_BREAKPOINT_COND; breakpoint, with the following log error:

    <error> app: ERROR 17 [NRF_ERROR_BUSY] at C:\nRF5_SDK_17.1.0_ddde560\examples\peripheral\twi_sensor_edited\main.c:144

    for context, this is the block around line 144:

    I am assuming the error indicates that the TWI line is busy, but for now I haven't been able to find the exact definition of error code 17. 

    This strikes me as odd, as the while loop in line 137 should make sure the previous transfer is finished.

    This is the current code in the TWI handler, I made sure to remove the m_xfer_done = true in the default case:

  • Hi,

     

    As you have external pull-ups, I would assume that this is not an assert/software-reset that happens?

    Your clock line is held low, and if the i2c sensor is the source of this; the sensor is performing clock stretching.

    Does it ever recover from this low scenario on the SCL pin?

     

    *edit* Just saw your "update 2" post.

    Is there any timing requirements on this external sensor that might be violated? Does the issue occur randomly, or consistently after x amount of seconds?

     

    Kind regards,

    Håkon

  • Hi Håkon,

    After some more debugging, I am now at the following:

    1. I can succesfully read the 'high threshold' register from the ADC(2 bytes) every 500 ms

    2. I can only read the 'measurement' register (2 bytes) twice before the master pulls down the SDA line and the slave (sensor PCB) pulls down the SCL. This occurs after the NAK and before a STOP condition.

    Example:

    I placed a series resistor on both line, with pull-ups on both sides, to measure which side was pulling down which line.

    When I read from the same register, but only one byte instead of the full two bytes, I can once again succesfully read every 500 ms. 

    To me the above mentioned indicates that the settings of the TWI and the capacitance of the bus are not the issue. My current theory is that the sensor pulls down the SCL line, which causes the TWI driver to get stuck / pull down the SDA. Do you know more about the behaviour of the TWI driver when this situation occurs?

    The odd thing is that the ADS1114 datasheet clearly states that it does not perform clock stretching, so I am currently trying to figure out why reading one register is fine while with other registers it is not, including why it works with 1 byte and not the full two bytes.

    Do you perhaps know of other scenarios where this behaviour occurred?

    Kind regards,

    Frederik

    EDIT: I managed to make it work by resetting the ADS1114 by doing the 'general call' reset as mentioned in the datasheet, after every one full read of two byte:

    1. General call + reset

    2. Write to ADS1114 configuration register with desired configuration

    3. Point ADS1114 pointer to the measurement register (by writing 0x00 to the IC)

    4. Read two bytes from the ADS1114

    Repeat.

    I am not satisfied with this approach, but I am glad that it actually works now. 

    As this is not related to the nRF52840 anymore, I will contact the ADC manufacturer. I will report back if I find a better solution.

    Thank you for your time and assistance, Håkon.

  • Hi,

     

    FK42 said:

    I am not satisfied with this approach, but I am glad that it actually works now. 

    I fully understand that. As the slave pulls down the SCL indicates that there's something going wrong on the sensor side, ie. a clock stretch occurring for some reason.

    Another approach can be to try to clear the bus from the master side, by running nrf_drv_twi_uninit(), then set nrf_drv_twi_config_t::clear_bus_init when re-initializing the twi peripheral.

     

    According to the datasheet of the sensor, it should not do clock stretching, so there's something strange happening.

     

    Kind regards,

    Håkon

Reply
  • Hi,

     

    FK42 said:

    I am not satisfied with this approach, but I am glad that it actually works now. 

    I fully understand that. As the slave pulls down the SCL indicates that there's something going wrong on the sensor side, ie. a clock stretch occurring for some reason.

    Another approach can be to try to clear the bus from the master side, by running nrf_drv_twi_uninit(), then set nrf_drv_twi_config_t::clear_bus_init when re-initializing the twi peripheral.

     

    According to the datasheet of the sensor, it should not do clock stretching, so there's something strange happening.

     

    Kind regards,

    Håkon

Children
  • Dear Håkon,

    I found the issue, and it has nothing to do with any of the ICs. It was a plain human error on my part:

    I had two other ICs to connected to the I2C bus, unfortunately these two ICs were connected to the bus with the SCL and SDA switched. I can only assume that for the first few bytes, it worked fine (by chance), but after the second measurements one of the other ICs started pulling the bus low.

    I fixed those connections, and now everything works perfectly, although the PCB did not get any prettier in the process.

    Thank you once again for your time and consideration, Håkon.

    Kind regards,

Related