The curious case of statsd and netcat
Posted on 17 May 2013
At GDS we are using statsd, a great tool from etsy for aggregating statistics and submitting them to graphite. I was interrogating statsd about which counters it currently knew about, by piping the output of an echo command into netcat, as suggested in the statsd management interface documentation:
$ echo counters | nc localhost 8126
This worked most of the time, giving output such as:
$ echo counters | nc localhost 8126
{ 'statsd.packets_received': 1,
'statsd.bad_lines_seen': 1,
foo: 0,
bar: 0,
baz: 1337 }
END
However, occasionally it would fail, giving no output at all:
$ echo counters | nc 10.0.0.1 8126
$
and moreover, whenever this happened, statsd had died:
events.js:72
throw er; // Unhandled 'error' event
^
Error: write EPIPE
at errnoException (net.js:883:11)
at Object.afterWrite (net.js:700:19)
What was going on? To find out, I had to go on a journey through TCP low level internals.
TCP basics
TCP is a transport-layer protocol of the internet. It presents an abstraction which appears to be a bidirectional continuous stream of bytes sent between two nodes on a network. This link is known as a connection, and is uniquely identified by four things: the source address and port, and the destination address and port.
A TCP connection is bidirectional; many of the control aspects of TCP can be understood as applying to one or other sides of the connection. For example, consider well-known "three-way handshake":
- Client sends SYN
- Server sends SYN/ACK
- Client sends ACK
This can be seen as two separate channels being created. First, the client sends a SYN to set up the client→server channel, which the server ACKs. The server then sends a SYN to set up the server→client channel, which is ACKed by the client. The server's ACK of the client→server channel can be sent at the same time as the SYN to create the server→client channel, shortening this from four steps to three. (This mental model, of two independently created channels, also works for the lesser-used "simultaneous open" mode: both peers send SYN packets simultaneously, and both respond with SYN/ACKs simultaneously. Since both sides of the connection have now been ACKed, the connection is established.)
Similarly, a TCP connection is torn down one side at a time. Alice sends a FIN packet to Bob to state that they will not send any more data along their side of the connection. However, Bob is free to continue to send data back to Alice indefinitely, and the connection does not need to close until Bob sends his FIN packet to terminate his side. It looks like this:
- Alice and Bob have an established connection
- Alice sends FIN to Bob, which Bob ACKs
- Bob continues to send data to Alice
- Bob finally sends FIN to Alice
- Alice receives Bob's FIN, sends ACK
- Bob receives Alice's ACK
- Connection is now closed (I'm ignoring TIME-WAIT for simplicity)
The important thing here is the possibility of a "half closed" connection: one where Alice has closed her side but Bob has not closed his side. Alice can no longer talk to Bob, but Bob can talk to Alice.
Incidentally, a TCP implementator MAY perform a "half-duplex" close, where Alice tears down her connection without waiting for Bob's FIN packet. If Bob sends more data, Alice will not receive it. To indicate this, she sends a RST packet to Bob to indicate that the data was not received correctly. This is documented in RFC1122, section 4.2.2.13.
Statsd's management port
This brings us to statsd's management port. Statsd normally receives UDP packets containing event data on port 8125, but it can expose a TCP management interface on port 8126 to issue queries and commands operating on statsd's internal state. The port is a simple TCP connection, which accepts multiple commands across the lifetime of a connection. Here is an example session:
$ nc 10.0.0.1 8126
counters
{ 'statsd.packets_received': 0,
'statsd.bad_lines_seen': 0,
foo: 0 }
END
delcounters foo
deleted: foo
END
counters
{ 'statsd.packets_received': 0, 'statsd.bad_lines_seen': 0 }
END
Here, I issued three commands: counters
, delcounters foo
, and
counters
once more to show the effect. Statsd responded with output
to each of the commands in turn.
The fact that statsd accepts multiple commands during the life of one connection means that statsd does not automatically close the connection; it only closes the connection when the client closes it.
The problem
Returning to the original problem: I issued a command to statsd using echo and netcat. I got no output from statsd, and statsd also crashed:
$ echo counters | nc 10.0.0.1 8126
$
It turns out the problem is that the netcat I was using was
aggressively closing the connection in "half-duplex" fashion: it would
send the "counters" packet, send a FIN to indicate it was done, then
quit. By the time statsd had responded with its data, netcat wasn't
listening anymore, and the OS responded to statsd with a RST. Statsd
didn't handle this error, and bailed. (This was fixed in
324267c
, in 0.6.0).
One way I thought I could make it work was using netcat's -q switch. This tells netcat to wait for a number of seconds before quitting. However, this also delays netcat sending its FIN packet, which means that the connection won't close until the end of the timeout. This means that if I set a high timeout, such as 5 seconds, the command will always take at least 5 seconds; on the other hand, if I set a low timeout, such as 1 second, I run the risk of netcat quitting before it receives the expected data. What I want, however, is for netcat to send the FIN as soon as it reaches EOF on stdin, but to quit after 5 seconds even if statsd hasn't closed its side of the connection. This way, it will close quickly if statsd responds quickly, but it will time out if statsd is too slow.
At this point, I started experimenting with different netcat
implementations on different operating systems. It turns out different
netcats actually behave differently in these circumstances.
Here is an evaluation of the systems I tried using echo counters | nc
localhost 8126
on, both with and without -q 5
. There is no consistency.
Netcat version | Without -q | With -q 5 |
---|---|---|
Ubuntu 10.04 (OpenBSD netcat (Debian patchlevel 1.89-3ubuntu2)) | Sends contents of stdin + FIN; quits immediately, doesn't wait for response | Sends contents of stdin, waits 5 seconds, sends FIN and quits |
Debian 6.0 (nc [v1.10-38]) | Sends contents of stdin, waits for response. Never sends FIN (ie leaves connection open). | Sends contents of stdin + FIN; waits for either FIN from statsd, or 5 seconds, whichever comes first, and quits |
Mac OS X 10.6 (nc -h gives no version info) | Sends contents of stdin + FIN; waits for statsd to respond and close connection before quitting | Does not support the -q option. |
It seems that of the three netcat implementations I tested, they were all different in some way. The OS X nc seems most convenient -- it always closes the outgoing connection fast, but waits for the incoming data rather than quitting immediately. However, the lack of a timeout is dangerous -- without one, if statsd hangs, you will hang too. The Ubuntu nc is most painful -- you need to guess a timeout, but since you will always wait for the full timeout, you are punished for allowing a safety margin. And the debian nc is inconsistent: sometimes is closes the outgoing connection fast, and sometimes it doesn't, depending on whether you set the -q option.
Overall, the most convenient for screwing around is the OS X nc; but I would suggest the most robust usage is debian nc with a timeout set.
The ubuntu nc is singularly unfit for usage with echo in this way. What's amusing is that their man page even recommends this usage:
$ echo -n "GET / HTTP/1.0\r\n\r\n" | nc host.example.com 80
It also misses the -e option from echo to actually interpret the control characters correctly. If you try this with ubuntu nc against any but the fastest server, you won't get a response:
$ # ubuntu 10.04
$ echo -en "GET / HTTP/1.0\r\n\r\n" | nc www.google.com 80
$
As compared to the expected behaviour:
$ # debian 6.0
$ echo -en "GET / HTTP/1.0\r\n\r\n" | nc www.google.com 80
HTTP/1.0 302 Found
Location: http://www.google.co.uk/
Cache-Control: private
Content-Type: text/html; charset=UTF-8
#...etc...
Summary
There are many flavours of nc out there, each with slightly different treatments of how to close a TCP connection. If you're getting unexpected behaviour from piping echo into netcat, it may be due to odd connection teardown in your netcat.
Acknowledgements
Thanks to bob for helping dig some of this stuff out, and for suggesting I try different netcat implementations. Thanks to the stanford networking course for teaching me solid foundations to draw on while investigating this.