Improving HA Failures with TCP Timeouts

Mon 02 March 2015 by John Eckersberg Tags openstack fedora

Most people probably don't give too much thought to our old friend, TCP. For the most part, it's just a transparent piece of infrastructure that sits between clients and servers, and we focus most of our time paying attention to the endpoints, and rightly so. Almost always it does the right thing, and we don't care. Life is good.

However, sometimes you do need to care. This is about one of those times.

Background

This is what a typical HA deployment of Red Hat Enterprise Linux OpenStack Platform looks like, focusing specifically on the RabbitMQ service:

HA deployment

There are three controller nodes that are part of a pacemaker cluster. There is a virtual IP (VIP) for AMQP traffic that floats between the three controllers and is controlled by pacemaker. HAProxy is bound to listen on the VIP on all three controller nodes. At any point in time, only one controller will have the VIP, and only this node will be responsible for directing AMQP to the backend RabbitMQ servers via HAProxy. This HAProxy instance balances the traffic across all three instances of rabbitmq-server, one on each controller.

VIP Failover

At this point, everything is happily chugging along. There's a bunch of clients (or in AMQP parlance, consumers and producers) connected through the VIP, through HAProxy, ultimately to RabbitMQ on the backend. What happens if the VIP fails over to another node?

For the clients, it's pretty straightforward. The client will try to write to the server on the existing connection. (This is either a "normal" write, or via a TCP keepalive probe, which is a related topic that I won't go over here. Just assume it's there). Since the packet gets routed to a new host (whichever one newly picked up the VIP), and the existing TCP session is not recognized by the new host, the client will receive a TCP RST. The client cleans up the connection and reconnects to the VIP to establish a valid session on the new server.

However, the server behavior is more interesting. Consider:

There's a connection, through the VIP, through HAProxy, to RabbitMQ
The VIP fails over
RabbitMQ writes data to the connection. HAProxy receives the data from RabbitMQ and writes the data back to the client.

What happens?

The source address for the connection from HAProxy back to the client is the VIP address. However the VIP address is no longer present on the host. This means that the network (IP) layer deems the packet unroutable, and informs the transport (TCP) layer. TCP, however, is a reliable transport. It knows how to handle transient errors and will retry. And so it does.

TCP Retries

TCP generally holds on to hope for a long time. A ballpark estimate is somewhere on the order of tens of minutes (30 minutes is commonly referenced). During this time it will keep probing and trying to deliver the data.

See RFC1122, section 4.2.3.5 for details on the specification and the linux documentation for the related tunables in the kernel.

It's important to note that HAProxy has no idea that any of this is happening. As far as its process is concerned, it called write() with the data and the kernel returned success.

Work Queues / RPC

Let's take a step back now and consider what impact this has to the users of messaging. RPC calls over AMQP are a common use case. What happens (simplified) is:

The RPC client sends the request message into a work queue.
RabbitMQ delivers the message to one of the workers consuming messages from the work queue. How this gets scheduled is configuration-dependent.
The worker does the work to service the request and sends the reply back to the client.

If the VIP has failed over, #2 can trigger the problematic behavior described above. RabbitMQ tries to write the RPC request to one of the workers that was connected through the VIP on the old host. This request gets stuck in TCP retries. The end result is that the client never gets its answer for the request and throws an error somewhere.

The problem is that the TCP retries behavior allows "dead" consumers to hang around in RabbitMQ's view. It has no idea that they're dead, so it will continue to send messages until the consumer disconnects—when the TCP retries gives up some tens of minutes later.

TCP_USER_TIMEOUT

Since Linux 2.6.37, TCP sockets have an option called TCP_USER_TIMEOUT. Here's the original commit message, since it describes it better than I could:

commit dca43c75e7e545694a9dd6288553f55c53e2a3a3
Author: Jerry Chu <hkchu@google.com>
Date:   Fri Aug 27 19:13:28 2010 +0000

    tcp: Add TCP_USER_TIMEOUT socket option.

    This patch provides a "user timeout" support as described in RFC793. The
    socket option is also needed for the the local half of RFC5482 "TCP User
    Timeout Option".

    TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
    when > 0, to specify the maximum amount of time in ms that transmitted
    data may remain unacknowledged before TCP will forcefully close the
    corresponding connection and return ETIMEDOUT to the application. If
    0 is given, TCP will continue to use the system default.

    Increasing the user timeouts allows a TCP connection to survive extended
    periods without end-to-end connectivity. Decreasing the user timeouts
    allows applications to "fail fast" if so desired. Otherwise it may take
    upto 20 minutes with the current system defaults in a normal WAN
    environment.

    The socket option can be made during any state of a TCP connection, but
    is only effective during the synchronized states of a connection
    (ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
    Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
    TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
    connection due to keepalive failure.

    The option does not change in anyway when TCP retransmits a packet, nor
    when a keepalive probe will be sent.

    This option, like many others, will be inherited by an acceptor from its
    listener.

    Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

This is exactly what we want. If HAProxy writes something back to the client and it gets stuck retrying, we want the kernel to timeout the connection and notify HAProxy as soon as possible.

Fixing HAProxy

I talked to Willy Tarreau over at HAProxy about this behavior, and he came up with this patch which adds a new tcp-ut option, allowing the user to configure the TCP_USER_TIMEOUT value per listening socket. This is available in the 1.6 development branch of HAProxy, and is currently being backported to Fedora and RHEL OpenStack Platform.

Fixing Linux

I told you a little white lie before. Although Linux gained TCP_USER_TIMEOUT in 2.6.37, it was broken for some cases until 3.18. Importantly, one of those cases is the case I've discussed here. It doesn't handle the situation where the TCP layer has put packets into the output queue, but the packets weren't transmitted. That's the case when IP can't route it. So you'll need either a kernel >= 3.18, or one which has this patch backported and applied:

commit b248230c34970a6c1c17c591d63b464e8d2cfc33
Author: Yuchung Cheng <ycheng@google.com>
Date:   Mon Sep 29 13:20:38 2014 -0700

    tcp: abort orphan sockets stalling on zero window probes

    Currently we have two different policies for orphan sockets
    that repeatedly stall on zero window ACKs. If a socket gets
    a zero window ACK when it is transmitting data, the RTO is
    used to probe the window. The socket is aborted after roughly
    tcp_orphan_retries() retries (as in tcp_write_timeout()).

    But if the socket was idle when it received the zero window ACK,
    and later wants to send more data, we use the probe timer to
    probe the window. If the receiver always returns zero window ACKs,
    icsk_probes keeps getting reset in tcp_ack() and the orphan socket
    can stall forever until the system reaches the orphan limit (as
    commented in tcp_probe_timer()). This opens up a simple attack
    to create lots of hanging orphan sockets to burn the memory
    and the CPU, as demonstrated in the recent netdev post "TCP
    connection will hang in FIN_WAIT1 after closing if zero window is
    advertised." http://www.spinics.net/lists/netdev/msg296539.html

    This patch follows the design in RTO-based probe: we abort an orphan
    socket stalling on zero window when the probe timer reaches both
    the maximum backoff and the maximum RTO. For example, an 100ms RTT
    connection will timeout after roughly 153 seconds (0.3 + 0.6 +
    .... + 76.8) if the receiver keeps the window shut. If the orphan
    socket passes this check, but the system already has too many orphans
    (as in tcp_out_of_resources()), we still abort it but we'll also
    send an RST packet as the connection may still be active.

    In addition, we change TCP_USER_TIMEOUT to cover (life or dead)
    sockets stalled on zero-window probes. This changes the semantics
    of TCP_USER_TIMEOUT slightly because it previously only applies
    when the socket has pending transmission.

    Signed-off-by: Yuchung Cheng <ycheng@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Reported-by: Andrey Dmitrov <andrey.dmitrov@oktetlabs.ru>
    Signed-off-by: David S. Miller <davem@davemloft.net>

I've proposed this for RHEL 7.

The (Hopefully Very Near) Future

Most of this behavior can be detected and avoided by using application layer AMQP heartbeats. However at this time the feature is not available in oslo.messaging, although this review has been close for a while now. Even with heartbeats, having TCP timeouts as another knob to tweak is a good thing, and it has more general applicability. I've had success using the timeouts to better detect inter-cluster failures of RabbitMQ nodes as well, but that will be a future post.

Categories
tech