ESP mbedTLS not detecting data drop

Amtac Quentin
Posts: 1
Joined: Thu Feb 07, 2019 7:29 am

ESP mbedTLS not detecting data drop

Postby Amtac Quentin » Thu Feb 07, 2019 7:40 am

Hello all,

I am having issues with mbedTLS in v3.1.2 IDF detecting when data connection is lost. I am setting up an SSL connection with our companies server and communicating by sending JSON packets. I have set the connection to non-blocking and I am calling mbedtls_ssl_read and mbedtls_ssl_write in timers. There is no issue with an active connection and messages are sent and received fine but once I turn off data to simulate a 4G dropout on our remote router mbedtls_ssl_read only ever returns MBEDTLS_ERR_SSL_WANT_READ and mbedtls_ssl_write continues to send messages and returns that they have been successfully sent.

I believe the issue has something to do with setting the connection as KEEP_ALIVE however I have not found much support for this with mbedTLS. Has anybody had a similar experience and been able to overcome this issue?

Thank you in advance for any assistance you may provide,
Quentin

ESP_Angus
Posts: 2344
Joined: Sun May 08, 2016 4:11 am

Re: ESP mbedTLS not detecting data drop

Postby ESP_Angus » Fri Feb 08, 2019 1:39 am

Hi Quentin,

You're right that without some indication that the connection is closed (such as: the local wifi network interface goes down, an RST packet is received, or an ICMP port unreachable packet is received) then the LWIP TCP/IP stack will by default keep waiting for the connection to come back.

TCP/IP is designed to be resilient on lossy networks, and it can take a long time for the default settings to tell the difference between a lossy network and one which is gone entirely.

(I'm going to talk about TCP send() and recv() in this reply. This is what mbedTLS is using at the lower network layer, and permanent failures here will be translated into permanent failures at the mbedTLS layer also.)

For send(), this means the packet is queued to the TCP/IP layer (send() will return success) and then inside the TCP/IP layer it will keep trying retransmission of that packet until an ACK is received or the data transmission retry limit is hit, at which point the connection is given up and calls to send() or recv() will return errno 113 (ECONNABORTED). Because each retry has an exponential backoff (doubles the retry time), at the default setting of 12 retries this takes over an hour to fail. You can reduce the number of retries in menuconfig under LWIP settings (see link above), but this is probably not the ideal place to set a timeout.

For recv(), the connection will keep trying/waiting (unless a send is also being retried in the background, in which case it will immediately also fail when the retransmit timeout runs out).

As you mention, one option here is to enable TCP socket keepalives. To enable keepalives call setsockopt() as follows:

Code: Select all

int enable = 1;
err = setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE, &enable, sizeof(enable));
However, the default KEEPALIVE settings are very conservative - the socket will wait 2 hours before sending the first keepalive probe (setting TCP_KEEPIDLE), then send another probe every 75 seconds (TCP_KEEPINTVL) before finally timing out if nothing is received after 9 probes (TCP_KEEPCNT).

To make this more aggressive you can set additional socket options:

Code: Select all

int keepidle = 10; // seconds of idleness before first keepalive probe
err = setsockopt(sock, IPPROTO_TCP, TCP_KEEPIDLE, &keepidle, sizeof(keepidle));
        
int keepintvl = 5; // interval between first and subsequent keepalives
err = setsockopt(sock, IPPROTO_TCP, TCP_KEEPINTVL, &keepintvl, sizeof(keepintvl));

int keepcount = 4; // number of lost keepalives before we close the socket
setsockopt(sock, IPPROTO_TCP, TCP_KEEPCNT, &keepcount, sizeof(keepcount));
However, keep in mind that the TCP layer may not be the best place to time out. After all, for TLS purposes a valid TCP socket that responds to keepalives but never sends any TLS messages is not really valid. This can happen if for some reason the server keeps the socket open, but nothing is alive to talk to it. It may be better for you to implement an application-level timeout ("time out if the other side doesn't send expected application response to particular application request for expected number of seconds") and then close both the TLS connection and the underlying socket if this happens. After all, your specific application knows what kind of response times are required for whatever functionality it implements, and what a valid response will look like.

(Note that everything mentioned above is independent from the socket "non-blocking mode" and any "read timeout" set on the socket via mbedTLS. Both these things just relate to long to wait before a particular API call returns to the caller. Although they may be useful if implementing an application-level timeout.)

Who is online

Users browsing this forum: No registered users and 79 guests