An awkward interaction between lazy ByteStrings and a misbehaving (non-)transparent HTTP middlebox

Ben Clifford, benc@cqx.ltd.uk

London Haskell User Group, May 2015

Press 'a' and you will get this as a single page with in-line speakers notes.

Initial symptoms

On a particular network, apparently 100% reproducible:

$ cabal update
Downloading the latest package list from hackage.haskell.org
cabal: Codec.Compression.Zlib: premature end of compressed stream

This error is reported occasionally, which you can see in Google search results. There hasn't been a clear solution.

Cabal is the build tool that everyone here probably loves/hates.

zlib is a compression library, and what it is saying is that it was given some compressed data that stopped unexpectedly.

I dug into this and what is going on, at least in this particular case, is nothing to do with cabal or zlib specifically.

The same problem can be reproduced without cabal:

Small test case that reproduces the problem

import Network.Browser
import Network.HTTP.Base
import Network.URI
import Data.Maybe
import qualified Data.ByteString.Lazy as B

main = do
  r <- browse $ request req
  print r
  putStrLn $ "ByteString reports length as " ++ (show $ B.length $ rspBody $ snd r)

req :: Request B.ByteString
req = mkRequest GET (fromJust $ parseURI "http://hackage.haskell.org/packages/index.tar.gz")

On most networks: ByteString reports length as 9680487 (9MB)

On broken network: ByteString reports length as 80000 (80kB) - varies non-deterministically

We don't even need to talk to hackage to see this problem. Getting any large file demonstrates the problem.

So it looks like somewhere in the bowels of Network.HTTP, the response is being truncated.

Strict ByteStrings make this work

import Network.Browser
import Network.HTTP.Base
import Network.URI
import Data.Maybe
import qualified Data.ByteString {- .Lazy -} as B

main = do
  r <- browse $ request req
  print r
  putStrLn $ "ByteString reports length as " ++ (show $ B.length $ rspBody $ snd r)

req :: Request B.ByteString
req = mkRequest GET (fromJust $ parseURI "http://hackage.haskell.org/packages/index.tar.gz")

Changing from lazy bytestrings to strict ones makes this work...

Maybe there's a race condition here? In my mind, laziness in an I/O context is associated with those.

Let's put that to one side now and talk about...

We've got several protocols in a stack. At the bottom is IP, the internet protocol; then TCP the transmission control protocol; and above that HTTP, the Hypertext Transfer Protocol.

Each layer provides services used by the layer above.

At the bottom, IP delivers datagrams, smallish packets of data, across the internet without making many guarantees about how they are going to be delivered.

Above that, TCP provides a reliable, ordered byte stream between two computers.

And HTTP deals with things like retrieving contents from URLs, over TCP.

The internet should just pass IP packets back and forth between the stacks, without knowing anything about the higher level TCP and HTTP protocols.

This is called the end-to-end principle.

On the troublesome network, there is a middlebox between the client and the internet. This middlebox intercepts all of the HTTP traffic by acting as an HTTP server, and making requests to the real target server on your behalf.

Reasons to do this are to force caching, filter malware, and to censor undesirable content.

So this is part of the explanation of why the behaviour is different on this specific network: the web server we are talking to is a different web server, and behaves differently.

Difference in HTTP response

From a packet dump:

HTTP/1.0 200 OK
Server: nginx/1.6.2
Content-Type: application/x-gzip
Cache-Control: public, no-transform, max-age=300
Content-MD5: 74e35e2d82cbc38feab6ef1486bb30d1
ETag: "74e35e2d82cbc38feab6ef1486bb30d1"
Last-Modified: Tue, 21 Apr 2015 12:55:42 GMT
Content-Length: 9364324
Accept-Ranges: bytes
Date: Tue, 21 Apr 2015 13:20:43 GMT
Age: 67
X-Served-By: cache-lhr6335-LHR
X-Cache: HIT
X-Cache-Hits: 1
X-Timer: S1429622443.670970,VS0,VE43
X-Cache: MISS from localhost
Connection: close

The HTTP response from the middleware box looks a bit different from the HTTP response from the hackage server.

The relevant part here is this Connection: close header. It triggers a different code path in the Haskell HTTP client library.

Normally a TCP connection can be re-used for many HTTP requests in a row; Connection: close means that a TCP connection should only be used for one HTTP request/response and then closed; and new HTTP requests should happen on new TCP connections.

This close option is pretty rare, so I wondered if there was a bug in Network.HTTP related to this.

So I configured my test web server to disable keep alives and return a Connection: close header, to see if I could reproduce this away from the misbehaving network. I couldn't, but this is still relevant.

So let's dig a bit deeper down the stack into TCP behaviour. HTTP uses TCP to provide a reliable stream of data between computers; and TCP does that by sending packets using IP.

So what does that look like?

... but this looks different when we use the lazy bytestring implementation ...

There's a fairly subtle change here: the client to server FIN is sent right around the time the HTTP response is delivered. On most networks, this is fine - the client to server half of the connection is closed, but the server to client half of the connection is still open and our data still arrives.

But in the case of the misbehaving network, the the HTTP session gets terminated pretty much as soon as this FIN arrives - the middlebox web server is (mis)interpreting the FIN to mean "close the connection right now, stop stop stop!!!". The connection closes and Network.HTTP assumes that is all the data.

This is, I think, a bug in the middlebox, and the only actual bug in all of this.

There is one other bit of misbehaviour, this time on the part of Network.HTTP:

Ignoring Content-length header

HTTP/1.0 200 OK
Server: nginx/1.6.2
Content-Type: application/x-gzip
Cache-Control: public, no-transform, max-age=300
Content-MD5: 74e35e2d82cbc38feab6ef1486bb30d1
ETag: "74e35e2d82cbc38feab6ef1486bb30d1"
Last-Modified: Tue, 21 Apr 2015 12:55:42 GMT
Content-Length: 9364324
Accept-Ranges: bytes
Date: Tue, 21 Apr 2015 13:20:43 GMT
Age: 67
X-Served-By: cache-lhr6335-LHR
X-Cache: HIT
X-Cache-Hits: 1
X-Timer: S1429622443.670970,VS0,VE43
X-Cache: MISS from localhost
Connection: close

We've got a content length header telling us how long the response is, in bytes. Network.HTTP could have recognised that this was not the same as the number of bytes it actually received, and thrown an error of some kinds.

This wouldn't fix our high level problem but might have given more useful clues for debugging this.

But why is this happening in the Haskell code with lazy ByteStrings? Dig into the source of Network.HTTP and Network.TCP.

This is pseudocode for what happens in the client. Let's plot this code against the packet trace graphs, for strict bytestrings and for lazy bytestrings.

So in the strict case, everything gets read before we start looking at headers and deciding to close.

But, in the lazy case, we only force as much to be read as we need - asking if there is a Connection: close header forces enough of the headers to decide that, and then we close, leaving the rest to be read lazily. And that manifests as on-the-wire behaviour.

An awkward interaction between lazy ByteStrings and a misbehaving (non-)transparent HTTP middlebox

Initial symptoms

Small test case that reproduces the problem

Strict ByteStrings make this work

The internet protocol stack (at least for this talk)

What things should look like

What things really look like

Difference in HTTP response

TCP connection transferring HTTP (strict bytestring)

TCP connection transferring HTTP (lazy bytestring)

Ignoring Content-length header

confluence of factors

Client pseudocode

Packet trace vs pseudocode

solutions?