StackOverdrive.io Ranks No. 923 on the 2020 Inc. 5000 Annual List of
America’s Fastest-Growing Private Companies

Debugging and solving timeouts after receiving data from a specific host in Openstack due to missing FIN packets

On a daily basis our DevOps Consultants are presented with challenges that require the sharpest of minds to solve. This is one such case.

IP addresses and host names have been anonymized in the output below

Our client reported a strange issue within their Openstack cluster. HTTP requests to a certain third party service would wait sixty seconds before finally completing successfully with the expected data. A request that stalls for sixty seconds will usually result in a timeout from one of the hosts involved, so returning successfully is very unusual. Even with a successful response, a sixty second GET within the end user’s request flow still caused a bad experience (either timing out the user’s request, or at least making them wait a very long time).

Looking at the affected requests we trimmed it down to a simple reproduction case:

<?php $url ='https://api.example.com/simplerequest.php'; $result = file_get_contents($url); print($result);?>

Running this simple snippet from the PHP CLI on a VM reproduced the exact same behavior 100% of the time. There’s nothing special going on here, a trivial GET request to a third party service via PHP. Experimenting with the case we quickly found some more details:

  • PHP GET requests to many other hosts worked without issue
  • The same request in cURL returned immediately
  • The same request via PHP on an Openstack host (rather than a VM) returned immediately

These facts revealed a specific combination of factors, and changing any of these would cause the request work fine. Only the combination of making a request to this host, via PHP, on a VM in our cluster caused the strange stalls to occur.

Digging In

Next we began investigating our case with strace. We saw the process stall on a poll syscall (after reading the HTTP response).

poll([{fd=3, events=POLLIN|POLLPRI}], 1, 6000

So, we knew that PHP had received the response, but was still polling on the socket. Looking at strace for the same request made through cURL we saw cURL manually closing the connection after reading the end of the HTTP response. So it wasn’t that PHP was doing something wrong, but that cURL was looking for the end of the response and closing the connection on its own.

Next we looked at tshark output from the request:

 10.0.1.8 → 111.11.11.111 TCP 74 46518 → 443 [SYN] Seq=0 Win=28040 Len=0 MSS=1402 SACK_PERM=1 TSval=2024886279 TSecr=0 WS=12
111.11.11.111 → 10.0.1.8     TCP 66 443 → 46518 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=1400 SACK_PERM=1 WS=1024
    10.0.1.8 → 111.11.11.111 TCP 54 46518 → 443 [ACK] Seq=1 Ack=1 Win=28160 Len=0
    10.0.1.8 → 111.11.11.111 TLSv1 571 Client Hello
111.11.11.111 → 10.0.1.8     SSL 1128 [TCP Previous segment not captured] , Continuation Data
    10.0.1.8 → 111.11.11.111 TCP 66 [TCP Window Update] 46518 → 443 [ACK] Seq=518 Ack=1 Win=30208 Len=0 SLE=1403 SRE=2477
111.11.11.111 → 10.0.1.8     TCP 1456 [TCP Out-Of-Order] 443 → 46518 [ACK] Seq=1 Ack=518 Win=67584 Len=1402
    10.0.1.8 → 111.11.11.111 TCP 54 46518 → 443 [ACK] Seq=518 Ack=2477 Win=33024 Len=0
    10.0.1.8 → 111.11.11.111 TLSv1.2 134 Change Cipher Spec, Application Data
   10.0.1.8 → 111.11.11.111  TLSv1.2 282 Application Data, Application Data, Application Data, Application Data
111.11.11.111 → 10.0.1.8     TLSv1.2 1426 Application Data, Application Data
   10.0.1.8 → 111.11.11.111  TCP 54 46518 → 443 [ACK] Seq=826 Ack=3849 Win=35840 Len=0

10.0.1.8 is our host, and 111.11.11.111 is the remote. This is a bit of terminal soup, but a careful reading will reveal that we never see a remote FIN packet. Because of this, PHP waits until the poll syscall above times out, then resets the connection from our end.

We already knew that the same request worked fine outside of our VMs. Running tshark directly on the VM’s compute host showed that indeed the remote does send a FIN packet. We just don’t receive it in the VM.

111.11.11.111 -> 10.0.1.8     TCP 60 https > 39084 [FIN, ACK] Seq=3841 Ack=826 Win=68608 Len=

A Solitary Experience

With a more proximate cause we looked for reports of similar failures. Searching for any bugs against Openstack or its components which mentioned partial but repeatable packet loss in certain connections. While there are many results for total packet loss, or complete unidirectional packet loss, those all stem from complete mis-configurations of Openstack where no connections work properly at all. In our case only one specific combination of host and client was misbehaving, but the thousands of requests done to other services for package installation or normal operation of our app worked without issue. After an exhaustive search of the bug trackers for Openstack, Neutron, and Openvswitch, it seemed we were alone in this experience.

This was especially surprising, as this cluster had been built with an off-the-shelf installer (packstack) with entirely default config options on a supported operating system. We’d intentionally chosen this vanilla happy path to avoid the type of singleton bugs we’d encountered. There was nothing special about how we’d built the cluster, but still we seemed to uniquely be experiencing this behavior.

Often in troubleshooting infrastructure problems, we’re able find resolutions to strange issues that occur through the work of others who’ve run in to the issue before us. However, as our case becomes more and more specific it becomes less likely that someone has had the exact same problem. While there are many people running web services on Openstack, it’s very unlikely that anyone else was making requests to this same third party service with a client that was similarly sensitive to missing FIN packets.

Creative Alternative Solutions

As an aside, there are a few ways to potentially work around the issue. We noted that cURL doesn’t suffer from the stalled request. PHP helpfully comes with cURL bindings. Rewriting the same request to use the cURL bindings instead of file_get_contents in our repro case did mitigate the issue. However, there were many calls in the application to various services, and all would have to be rewritten for this separate API. Our client, understandably, preferred not to persue this route as it required extensive changes to very sensitive parts of the application.

Another approach could be to build a patched PHP which handled the missing FIN packet better by terminating the connection when the ending line feeds of the response came through. This would require familiarity with the PHP implementation, then maintaining our own patched PHP version and building it for any VMs in our infrastructure that needed it. All of this to only obscure the underlying problem that we were in fact missing packets.

Under tighter time constraints, either of these could have been used as band-aids to buy time while investigating a solution. In this case, fortunately, we had time to find the root cause and fix the problem in a more general way.

Narrowing the Search

Knowing that our missing packets made it on to the compute hosts, but not on to the VM limited the list of suspects. Further tshark investigation showed packets were present on the Geneve interface (which joins compute nodes to the SDN) but missing on the VM’s TAP interface (a virtual interface connected directly to QEMU). Somewhere between Geneve and the TAP our packets were being dropped.

This also rules out other usual networking suspects. The path between these interfaces is entirely local, meaning that this wasn’t the result of network congestion, bad firewall or routing rules from our data center host, or a faulty hardware NIC.

In fact, both of these interfaces are connected by a single component: the OpenVSwitch software-defined bridge. In short, the expected path for a packet to get a VM is:

  1. Networking host (Openstack master server)
  2. Compute host’s Geneve interface
  3. OpenFlow (within OpenVSwitch) routing
  4. TAP Interface
  5. Guest OS on the VM

We had narrowed the problem down to number 3: OpenVSwitch or its OpenFlow routing. OpenVSwitch has two components relevant here: a kernel module and a user-space daemon. The daemon exposes an API for managing the SDN topology, and some statistics about that topology.

Looking at the routing statistics we could see that packets coming out of OpenVSwitch to the TAP interface had been dropped. We already sensed this was the case, but the statistics confirmed it. These packets can be dropped by OpenFlow if they don’t match an established connection, or if they’re explicitly rejected by one of the flow rules. A mis-configured flow rule seemed unlikely, as it would certainly be affecting other users. After carefully reviewing the flow rules to be certain, we decided to look at connection tracking.

To see if packets were being dropped by conntrack (a kernel subsystem which keeps a record of ongoing connections) we enabled kernel debug messages in conntrack:

echo "255"> /proc/sys/net/netfilter/nf_conntrack_log_invalid

There are other, more modern, ways of enabling debug output in the kernel (debugfs) but some older mechanisms like conntrack still use procfs to control log behavior.

Re-running our repro case yielded clear output in dmesg: packets were being dropped due to checksum failures.

The Fix and Root Cause

With a clear failure mode and point of failure in the packet routing flow, we started looking for mitigations. Reading through sysctl parameters related to conntrack we found net.netfilter.nf_conntrack_checksum . It’s a bit of a misnomer, but this parameter dictates whether conntrack will use control packets (like SYN and FIN) with invalid checksums to determine the connection state.

We set the parameter to zero via sysctl, ran our repro case, and were met immediately with a successful call. We were thrilled to have fixed the reproduction case, especially with a sysctl parameter that wouldn’t require any production downtime to apply. Still, we wondered how our packets, specifically just the ACK and FIN packets, were given invalid checksums.

A bit of searching quickly found this Redhat KB page: https://access.redhat.com/solutions/4003091

Intermittent packet drops in OpenvSwitch with Conntrack ACL

Currently, in the OVS conntrack receive path, ovs_ct_execute() pulls the skb to the L3 header but does not trim it to the L3 length before calling nf_conntrack_in(NF_INET_PRE_ROUTING).

When nf_conntrack_proto_tcp encounters a packet with lower-layer padding, nf_ip_checksum() fails causing a nf_ct_tcp: bad TCP checksum log message.

In plain english, there was a bug in the version of the OVS kernel module we used that improperly checksummed packets shorter than 60 bytes. That’s why SYN packets (with extended headers) and PSH backets (with hundreds of bytes of data) were unaffected, but control packets like ACK and FIN were not. Additionally, most modern hosts send packets with extended TCP headers, pushing all of their packets above the 60 byte threshold. This third party service used shorter legacy TCP headers, likely due to legacy systems, leaving some of their packets below this line.

Upgrading the kernel isn’t on the roadmap right now as we’re already running the latest available for our OS version. So as a low-impact solution we’ve applied the sysctl parameter permanently via sysctl.d which has fixed all user reports of the issue.

Final Thoughts

Issues like this are pernicious, as they surface at the application level but stem all the way from within the kernel. To trace the problem all the way through requires an understanding of every part of the stack, from the application and its runtime to the hypervisor and the SDN, even in to the kernel itself. The tools needed to observe the behavior there are not something most engineers will have daily experience with. To effectively apply those tools requires substantial context and technical understanding of the domain. Understanding the link between layers of the problem requires a gestalt view of the system. Without an engineering team to diagnose this issue our client would have to absorb the cost of another migration, a massive application rewrite, or leave users facing sixty second timeouts while interacting with their web service.

 


Back To Insights