Philip Potter

  • Refactoring systems

    Posted on 21 February 2014

    “Refactoring is the process of changing a software system in such a way that it does not alter the external behavior of the code yet improves its internal structure.” – Martin Fowler, in Refactoring

    As a developer who fell into operations work by accident, I often relate what I see in production systems to my experiences manipulating code in an editor. One thing I’ve been thinking about recently is refactoring. When I was working on Java day-to-day, I would routinely use IntelliJ IDEA’s refactoring support: extract method, extract class, rename method, move method. Refactoring was an intrinsic part of good design: you never hit the perfect design first try, so you rely on refactoring to gradually improve your code from the first slap-dash implementation.

    It actually took some time before I read the original refactoring book. If you haven’t read it, go read it now! Even 14 years after it was written, it’s still relevant. It describes how to do refactoring when you don’t have fancy refactoring tools: through a series of small, incremental changes, which at no point cause the code to break, you can evolve from a poor design towards a better design.

    If I have one criticism of the book, it’s that it doesn’t reach far enough. The book’s subtitle is “Improving the Design of Existing Code”; but I think it should have been “Improving the Design of Existing Systems”. The approach of taking small, incremental changes to an existing system, to improve the overall design while at no point breaking anything, is by no means limited to code. This is acknowledged by the 2007 book Refactoring Databases; but it also applies to other aspects of system design such as DNS naming, network topology and mapping of functions to particular applications.

    The importance of maintaining a working system throughout a refactoring is high when developing, but it’s critical when modifying a production system. After all, when you’re developing code, the worst that can happen is a failing test, but the consequences in production are much more severe. And you don’t get any of IntelliJ’s automagical refactoring tools to help you.

    Sometimes you will need to make a deployment to production in between steps of a refactoring: perhaps to ensure you have introduced a new service name before configuring any consumers of that service to use it. As a result, refactoring whole systems can be slower and more laborious than refactoring code. Nevertheless, it’s the only way to improve the design of an existing system, short of creating an entirely new production in parallel and switching traffic over to it.

    Here are some recent refactorings I have performed on production systems:

    Rename service

    Problem: the name of a service, as it appears in DNS, does not describe its purpose.

    Solution: introduce a new name for the service. Gradually find each of the service’s consumers, and reconfigure them to use the new name. When the old name is no longer needed, remove it.

    Rename entire domain

    Problem: the name of a domain does not describe its purpose.

    Solution: introduce a new domain as a copy of the old. Modify the search domain of resolvers in that domain to search the new domain as well as the old. Find any references in configuration to fully-qualified domain names mentioning the domain, and change them to use the new domain. When no more consumers reference the old domain name, remove it from the search path, and finally remove the old domain entirely.

    I recently performed this dance to rename an entire environment from “test” to “perf”, to give it a more intention-revealing name. Using shortnames as much as possible, rather than fully-qualified names, made the job much easier. (There are other reasons for using shortnames: by standardizing on particular shortnames for services, you reduce the amount of environment-specific configuration needed.)

    Change IP address of service

    Problem: you want to shift a service from one public IP address to another.

    Solution: introduce a route from the new public IP to the service. Test the service works at the new IP – the curl --resolve option is very useful for testing this without having to modify DNS or /etc/hosts entries. Ensure any firewall rules which applied to the old IP address are copied to apply to the new address. When certain that the new IP address works, change your public DNS record from the old IP address to the new. Wait for the TTL to expire before using the old IP address for any other purpose. Finally, remove any stale firewall rules referring to the old address.

    This might seem like an odd thing to want to do, so here’s the context: in one of our environments, we had three public IP addresses listening on port 443. I wanted to expose a new service on port 443, but restricted at the firewall to only certain source addresses, so it couldn’t be a shared virtual host on an existing IP. One of our IPs was used by a reverse proxy with several virtual hosts proxying various management services – graphite, sensu, kibana. Our CI server, however, was occupying a public IP all to itself. If I created a new virtual host on our reverse proxy to front the CI server, I could serve it from the same IP address as our other management services, freeing up a public IP for a new service with separate access control restrictions.

    Merge sensu servers

    Problem: you have two sensu servers monitoring separate groups of clients, when you would rather have one place to monitor everything.

    Solution: choose one of the servers to keep. Reconfigure the clients of the other server to point at the original server. Remove the now-unused sensu server.

    Note: this refactoring assumes that the two servers have equivalent configuration in terms of checks and handlers, which in our situation was true. If not, you may need to do some initial work bringing the two sensu servers into alignment, before reconfiguring the clients.

    Remove intermediate http proxy

    Problem: On your application servers, you are using nginx to terminate TLS and reverse proxy a jetty application server. You wish to remove this extra layer by having jetty terminate TLS itself.

    Solution: Add a second connector to jetty, listening for https connections. Reconfigure consumers to talk to jetty directly, rather than talking to nginx. Once nothing is configured to talk to nginx, remove it.

    In our specific situation, we were using DropWizard 0.7, which makes it easy to add a second https connector. DropWizard 0.6 assumes that you have exactly one application connector, and it’s either http or https, but not both. We have some apps that are running DropWizard 0.6; our refactoring plan for them involves first upgrading to DropWizard 0.7, followed by repeating the steps above.

    It’s not a catalog, it’s a way of thinking

    The original refactoring book presented refactoring as a catalog of recipes for achieving certain code transformations. This is a fantastic pedagogical device: by showing you example after example, you can immediately see the real-world benefit. Then once you’ve seen a few examples of refactoring, it starts to become natural to come up with more. The same tricks turn up again and again: introduce new way of doing things, migrate consumers from old way to new way, remove old way. This general scheme applies to all sorts of refactoring from simple method renames to system-wide reconfiguration.

    Data makes everything harder

    The easiest part of the system to refactor is the application server. Given a good load balancer, a properly stateless application with a good healthcheck, and a good database, you can create a completely new design of application server, add it to the load balancer pool, test everything still works correctly, and remove the old design.

    Refactoring the system at the data layer is much harder. Migrating data from one DBMS to another is painful. Splitting a single database server shared between two applications into two database servers is painful. For certain types of migration, the ability to put the system into read-only mode might be necessary (or at least incredibly helpful).

    You won’t get the design right first time

    The reason that refactoring is important is that you won’t get the design right first time. You will inevitably need to make changes to accommodate new information. And so when you wistfully imagine the system as you’d like it to be, you have to discover the small steps which will get you there from the system you currently have. The way to reach a truly great design is to start with an okay design and evolve it.

    I don’t claim any of this is new

    I’m sure that these strategies have been used by operations people for years, without calling it “refactoring”. My main point is that the activities of code refactoring and system refactoring are based on the same underlying principles, of identifying small changes which do not change external behaviour in order to improve internal structure.

  • Keeping a record

    Posted on 15 February 2014

    People who work with me soon get to know that I keep meticulous records of meetings, conferences, user groups, and so on. If you haven’t seen examples, here are my notes from FOSDEM 2013 and devopsdays Paris: as you can see, I write down a lot of material. I do exactly the same at meetings in my workplace, at user groups, and suchlike.

    I do this because, like many people, I think that meetings are an incredibly unproductive way to spend time. I therefore want to ensure that the meetings that we do have count for something. I have attended too many meetings at which a decision was made but not written down, only for the team to later forget what the decision was, and have to call another meeting to discuss the whole issue again. I have also seen people attend meetings where they had nothing to contribute, they only wanted to listen to the discussion and understand the outcome.

    Taking comprehensive notes solves problems like these. Decisions made are recorded, action items are recorded along with who has taken responsibility for them, and to ensure everyone has a common understanding of what took place, I email a copy of my notes to all attendees. For those who were interested in the outcome rather than the discussion, they no longer have to attend the meeting and can instead skim my notes afterwards, saving them time.

    Having said that, I rarely read the notes that I take. I find that the act of writing things down has benefits in itself: it forces me to engage more – I can recognise when I’m drifting off because I stop taking notes; it forces me to restate the ideas I’m hearing in an abridged form, which means I’m not just passively listening but actively encoding. The combination of these effects mean that even if I were to save all my notes to /dev/null, it would still have been a massively beneficial activity, increasing my understanding and recall of what was said.

    How I take notes

    When taking notes, speed is everything. You can’t ask those present to slow down so you can capture all the detail. It’s not possible (or not possible for me, anyway) to take a full transcript of who said what, so some editorial judgement is necessary. I’m constantly trying to digest what is being said into key points to write down, and distilling out fluff, rhetoric, repetition, and extraneous detail.

    I am a touch-typist. I don’t think it’s possible to take comprehensive notes without this. At a conference, I will be looking at the speaker and the slides while typing; having to look at the keyboard to find keys would slow me down far too much. Sadly, I don’t think there’s any silver bullet here; learning to touch-type is a long and difficult process, but it’s necessary to be able to take comprehensive notes at a live event.

    I take notes using org-mode for emacs. I find org-mode a great fit for note-taking, because it is fast, intensely keyboard-focussed, and provides sufficient structure to be able to manage my notes. Here’s an example of the notes I might take in a meeting:

    * meeting to discuss design of widget processing
      [2014-02-15]
    * present
      - me
      - fred
      - jim
      - sheila
    * introduction by fred
    ** we need to perform widget processing
       - required by user journey x
       - enables users to deal with widgets but export as doohickeys
       - some prior art, but none seems to exactly match what we need
    * sheila
      - what about widgets.io? they do widget processing as a service
        - expensive, but probably cheaper than development effort to
          reimplement it
        - jim: not sure we can justify sending data about sensitive
          widgets to third parties
        - sheila: we could probably anonymise user data before processing
    * fred
      - I've previously used pydgets, a python widget-processing library
      - sheila: all our code is ruby/rails/sinatra at the moment
        - fred: we could create a separate python service for it
          - communicate using HTTP and json
          - sheila: I'm skeptical that our ruby folks would be happy
            writing python
          - me: we should investigate anyway, see if it's the right tool
            for the job
    * Actions
    ** spike: investigate pydgets and python RESTful services - fred
    ** spike: investigate widgets.io and anonymisation - sheila
    

    Org-mode is great because the source format is human-readable. I don’t need to tell the recipients that my notes are in org-mode; I just paste them into an email and send them verbatim.

    The headings and bullets are my bread-and-butter of note-taking. Org-mode provides shortcuts for easy and fast manipulation of headings and bullets: C-RET new heading, M-RET new bullet, M-up/M-down move heading or bullet up/down, M-left/right promote/demote heading, C-c - convert heading to bullet, C-c C-w refile heading under different toplevel heading. These manipulation functions mean that I don’t have to stick to taking notes in chronological order; I can easily move notes around to other parts of the file.

    The first heading I make is always a list of people present; the last heading is always a list of actions. It’s worth remembering that most meetings are called to make decisions about what actions to take; by taking notes on actions, I am focussed on ensuring that the meeting isn’t drifting into endless discussion and is actually making decisions. If someone says they will do something, I capture that as a new heading and refile at the bottom, keeping all actions together for easy review.

    Why I take notes

    I take notes primarily for my own benefit. By taking notes, I force myself to listen actively, not just hearing the words that are being spoken but grappling with the concepts and ideas being talked about, trying to reword them into a concise form by getting at the essence of what’s being said. This can confuse people: once, as I started taking notes, a speaker told me “you don’t need to take notes, I’ll send you my slides”; I responded “it’s just what I do”. The psychology literature talks about note-taking having the complementary functions of “encoding” and “storage” – I primarily use notes for encoding, and treat storage as secondary.

    I also take notes so that we have a record of decisions made. If there’s any confusion later on, I can return to my record and consult it. There should also be a record of the constraints that were considered when making the decision, so that we can later determine if they are still valid or if the decision should be revisited.

    Finally, I take notes and send them out to those present because my colleagues keep giving me good feedback about them. This feedback is invaluable because I almost never read my own notes, so the only sense I get of how useful they are to read is from other people’s reports.

    The way I take notes affects the way I participate in meetings. If I can see from my notes that we haven’t agreed on an action, I will push for a decision so that my “actions” heading starts to fill up. Sometimes I create an “agenda” heading near the top as a scratch space for notes I want to talk about but haven’t yet had the opportunity. My note-taking habit has got to the point that I can’t imagine not taking notes in a meeting anymore; it just has so many benefits that it would seem ludicrous not to.


    Incidentally, this post is the first in my blog written using org-mode. Previously I have been writing in markdown, because that’s the default for jekyll, but now that I’ve got org-mode working I’m thoroughly converted. Click on the source link to the left to see the source on github.

  • The git pickaxe

    Posted on 09 February 2014

    I care a lot about commit messages. I try to write them following Tim Pope's example, using a short summary line, followed by one or more paragraphs of explanation. It's not unusual for my commit message to be longer than the diff. Why do I do this? Is it just some form of OCD? After all, who really reads commit messages?

    The reason I care about commit messages is because I'm an avid user of the git pickaxe. If I'm ever confused about a line of code, and I want to know what was going through the mind of the developer when they were writing it, the pickaxe is the first tool I'll reach for. For example, let's say I was looking at this line from our puppet-graphite module:

    exec <%= @root_dir %>/bin/carbon-cache.py --debug start

    That --debug option looks suspect. I might think to myself: "Why are we running carbon-cache in --debug mode? Isn't that wasteful? Do we capture the output? Why was it added in the first place?" In order to answer these questions, I'd like to find the commit that added the switch. I could run git blame on the file, to find the last commit that touched the line. However that leads to a totally unrelated commit that had nothing to do with my --debug flag issue.

    So I still want to find the commit that added that --debug switch, but git blame has got me nowhere. What next? It turns out there's an option to git log which will find any commit which introduces or removes a string from anywhere in its commit:

    git log -p -S --debug

    This will show me every commit that either introduced or removed the string --debug. (It's a slightly confusing example, because --debug is not being used as a command-line switch to git, but as a string argument to the -S switch instead. Nevertheless, git does the right thing.) The -p switch shows the commit diff as well. There are in fact a few matches for this search, but the third commit that comes up is the winner:

    commit 5288d5804a3fc20dae4f3b2deeaa7f687595aff1
    Author: Philip Potter <philip.g.potter@gmail.com>
    Date:   Tue Dec 17 09:33:59 2013 +0000
    
        Re-add --debug option (reverts #11)
    
        The --debug option is somewhat badly named -- it *both* adds debug
        output, *and* causes carbon-cache to run in the foreground. Removing the
        option in #11 caused the upstart script to lose track of the process as
        carbon-cache started unexpectedly daemonizing.
    
        Ideally we want to have a way of running through upstart without the
        debug output, but this will fix the immediate problem.
    
    diff --git a/templates/upstart/carbon-cache.conf b/templates/upstart/carbon-cache.conf
    old mode 100644
    new mode 100755
    index 43a16ee..2322b2d
    --- a/templates/upstart/carbon-cache.conf
    +++ b/templates/upstart/carbon-cache.conf
    @@ -12,4 +12,4 @@ pre-start exec rm -f '<%= @root_dir %>'/storage/carbon-cache.pid
     chdir '<%= @root_dir %>'
     env GRAPHITE_STORAGE_DIR='<%= @root_dir %>/storage'
     env GRAPHITE_CONF_DIR='<%= @root_dir %>/conf'
    -exec python '<%= @root_dir %>/bin/carbon-cache.py' start
    +exec python '<%= @root_dir %>/bin/carbon-cache.py' --debug start
    

    Now I know exactly why --debug is there, and I know that I certainly don't want to remove it. But what if my commit message had just been "Re-add --debug option"? I'd be none the wiser. This is why I care so much about commit messages: because I have the tools to quickly get from a piece of code to the commit that introduced it, I spend much more time reading commit messages.

    This example is also interesting because it raises another question: should this explanation have been in a code comment instead? The --debug flag is inherently confusing, and a comment could have answered my questions even quicker by being right there in the file.

    However, a 6-line comment in the file would be quite a bit of noise whenever you weren't interested in the --debug switch, whereas a commit message can be as big as it needs to be to make the explanation clear. Comments and commit messages can be complementary: there could be a one-line comment saying that --debug causes carbon-cache to stay in the foreground, and a more detailed explanation in the commit message. In some ways I see commit messages as a type of expanded commenting system which is available at your fingertips whenever you need it but automatically hides when you just want to read the code.


    A couple of small postscripts: I could have even narrowed down my search further by adding a path filter to my log command:

    git log -p -S --debug templates/upstart/carbon-cache.conf

    This search finds the commit in question instantly: it's the first result. But unlike the original git log command, it is not resilient against the file being renamed in an intervening commit. I tend not to use path filters for pickaxe searches, because I can normally find what I want easily enough anyway.

    The -S switch takes a string match only. If you want to match a regex instead, you can add the --pickaxe-regex switch.

  • Automating dnsmasq and resolvconf

    Posted on 07 November 2013

    I've been working a lot with dnsmasq for DNS forwarding recently, and have hit enough problems that I thought it would be worth writing about them.

    On my current project, we're using Ubuntu 12.04, which uses dnsmasq as a local DNS cacher and forwarder, and resolvconf (the service as opposed to the resolv.conf file) to manage DNS server configuration.

    dnsmasq

    Dnsmasq is a simple DNS forwarder. It proxies multiple upstream DNS servers, add caching, and can even serve up A records from an /etc/hosts-style configuration file.

    Dnsmasq is configured by giving it an /etc/resolv.conf-style file with a list of nameservers. It will regularly poll this file for changes, and change its forwarding behaviour accordingly.

    Dnsmasq can also be configured to direct requests for particular domains to particular servers; for example, if you want everything in mycompany.com to go to your internal office server, but everything else to go to public DNS servers, dnsmasq can do that for you.

    Dnsmasq does NOT perform recursive DNS lookups; you will still need some form of recursive DNS server in order to achieve full DNS functionality.

    resolvconf

    resolvconf is part of the ubuntu-minimal install, which means that it's considered a pretty core part of the distribution these days. It's an evolution from the traditional /etc/resolv.conf file, which lists nameservers and search domains to use when resolving DNS names to IP addresses.

    You associate a nameserver with a particular network interface with a line such as:

    echo nameserver 192.0.2.6 | resolvconf -a IFACE.PROGNAME
    

    where IFACE is an interface, and PROGNAME is the name of an associated program. For example, dnsmasq itself registers itself with resolvconf by associating with the lo.dnsmasq entry. You can remove entries with resolvconf -d. Generally, you don't call resolvconf directly; instead, it is called automatically as part of bringing up a network interface, or starting a DNS service, or similar.

    Each time an interface is added or removed, resolvconf updates associated configuration files by running scripts in the /etc/resolvconf/update.d directory; one of these, libc, updates the traditional /etc/resolv.conf file.

    The problem

    This is where I get to the problem I was facing. I was trying to install and configure dnsmasq in a puppet run. However, immediately after dnsmasq was installed, I would start getting name resolution errors, and the rest of the puppet run would fail. But by the time I had logged onto the box to investigate, name resolution was working again! What was going on?

    It turns out there's a bit of a race condition when starting dnsmasq, particularly for the first time. What happens is this:

    1. /etc/init.d/dnsmasq starts the dnsmasq daemon. Dnsmasq, in its default configuration on ubuntu, looks for upstream nameservers in /var/run/dnsmasq/resolv.conf. Dnsmasq checks for the file, finds it missing, and gives up for the moment. It will poll again later.
    2. Once dnsmasq has started and returned, the init.d script registers 127.0.0.1 with lo.dnsmasq in resolvconf.
    3. resolvconf runs its updates, generating configuration for dnsmasq in /var/run/dnsmasq/resolv.conf and also changing the standard libc resolver file /etc/resolv.conf to only refer to 127.0.0.1, the dnsmasq process
    4. At this point, the dnsmasq service is the sole DNS server that the local resolver can see, but dnsmasq itself hasn't yet seen any upstream nameservers. Therefore it can't give any useful answers. At this point, my puppet run starts failing.
    5. After a few seconds, dnsmasq polls the /var/run/dnsmasq/resolv.conf file again and finally finds the upstream nameservers left for it by resolvconf in step #3 above.
    6. I log into the machine, try to resolve a name, and everything works.

    I have filed a bug at launchpad to raise this issue.

  • The curious case of statsd and netcat

    Posted on 17 May 2013

    At GDS we are using statsd, a great tool from etsy for aggregating statistics and submitting them to graphite. I was interrogating statsd about which counters it currently knew about, by piping the output of an echo command into netcat, as suggested in the statsd management interface documentation:

    $ echo counters | nc localhost 8126
    

    This worked most of the time, giving output such as:

    $ echo counters | nc localhost 8126
    { 'statsd.packets_received': 1,
      'statsd.bad_lines_seen': 1,
      foo: 0,
      bar: 0,
      baz: 1337 }
    END
    

    However, occasionally it would fail, giving no output at all:

    $ echo counters | nc 10.0.0.1 8126
    $
    

    and moreover, whenever this happened, statsd had died:

    events.js:72
            throw er; // Unhandled 'error' event
                  ^
    Error: write EPIPE
        at errnoException (net.js:883:11)
        at Object.afterWrite (net.js:700:19)
    

    What was going on? To find out, I had to go on a journey through TCP low level internals.

    TCP basics

    TCP is a transport-layer protocol of the internet. It presents an abstraction which appears to be a bidirectional continuous stream of bytes sent between two nodes on a network. This link is known as a connection, and is uniquely identified by four things: the source address and port, and the destination address and port.

    A TCP connection is bidirectional; many of the control aspects of TCP can be understood as applying to one or other sides of the connection. For example, consider well-known "three-way handshake":

    1. Client sends SYN
    2. Server sends SYN/ACK
    3. Client sends ACK

    This can be seen as two separate channels being created. First, the client sends a SYN to set up the client→server channel, which the server ACKs. The server then sends a SYN to set up the server→client channel, which is ACKed by the client. The server's ACK of the client→server channel can be sent at the same time as the SYN to create the server→client channel, shortening this from four steps to three. (This mental model, of two independently created channels, also works for the lesser-used "simultaneous open" mode: both peers send SYN packets simultaneously, and both respond with SYN/ACKs simultaneously. Since both sides of the connection have now been ACKed, the connection is established.)

    Similarly, a TCP connection is torn down one side at a time. Alice sends a FIN packet to Bob to state that they will not send any more data along their side of the connection. However, Bob is free to continue to send data back to Alice indefinitely, and the connection does not need to close until Bob sends his FIN packet to terminate his side. It looks like this:

    • Alice and Bob have an established connection
    • Alice sends FIN to Bob, which Bob ACKs
    • Bob continues to send data to Alice
    • Bob finally sends FIN to Alice
    • Alice receives Bob's FIN, sends ACK
    • Bob receives Alice's ACK
    • Connection is now closed (I'm ignoring TIME-WAIT for simplicity)

    The important thing here is the possibility of a "half closed" connection: one where Alice has closed her side but Bob has not closed his side. Alice can no longer talk to Bob, but Bob can talk to Alice.

    Incidentally, a TCP implementator MAY perform a "half-duplex" close, where Alice tears down her connection without waiting for Bob's FIN packet. If Bob sends more data, Alice will not receive it. To indicate this, she sends a RST packet to Bob to indicate that the data was not received correctly. This is documented in RFC1122, section 4.2.2.13.

    Statsd's management port

    This brings us to statsd's management port. Statsd normally receives UDP packets containing event data on port 8125, but it can expose a TCP management interface on port 8126 to issue queries and commands operating on statsd's internal state. The port is a simple TCP connection, which accepts multiple commands across the lifetime of a connection. Here is an example session:

    $ nc 10.0.0.1 8126
    counters
    { 'statsd.packets_received': 0,
      'statsd.bad_lines_seen': 0,
      foo: 0 }
    END
    
    delcounters foo
    deleted: foo
    END
    
    counters
    { 'statsd.packets_received': 0, 'statsd.bad_lines_seen': 0 }
    END
    

    Here, I issued three commands: counters, delcounters foo, and counters once more to show the effect. Statsd responded with output to each of the commands in turn.

    The fact that statsd accepts multiple commands during the life of one connection means that statsd does not automatically close the connection; it only closes the connection when the client closes it.

    The problem

    Returning to the original problem: I issued a command to statsd using echo and netcat. I got no output from statsd, and statsd also crashed:

    $ echo counters | nc 10.0.0.1 8126
    $
    

    It turns out the problem is that the netcat I was using was aggressively closing the connection in "half-duplex" fashion: it would send the "counters" packet, send a FIN to indicate it was done, then quit. By the time statsd had responded with its data, netcat wasn't listening anymore, and the OS responded to statsd with a RST. Statsd didn't handle this error, and bailed. (This was fixed in 324267c, in 0.6.0).

    One way I thought I could make it work was using netcat's -q switch. This tells netcat to wait for a number of seconds before quitting. However, this also delays netcat sending its FIN packet, which means that the connection won't close until the end of the timeout. This means that if I set a high timeout, such as 5 seconds, the command will always take at least 5 seconds; on the other hand, if I set a low timeout, such as 1 second, I run the risk of netcat quitting before it receives the expected data. What I want, however, is for netcat to send the FIN as soon as it reaches EOF on stdin, but to quit after 5 seconds even if statsd hasn't closed its side of the connection. This way, it will close quickly if statsd responds quickly, but it will time out if statsd is too slow.

    At this point, I started experimenting with different netcat implementations on different operating systems. It turns out different netcats actually behave differently in these circumstances. Here is an evaluation of the systems I tried using echo counters | nc localhost 8126 on, both with and without -q 5. There is no consistency.

    Summary of behaviour of different netcats
    Netcat version Without -q With -q 5
    Ubuntu 10.04 (OpenBSD netcat (Debian patchlevel 1.89-3ubuntu2)) Sends contents of stdin + FIN; quits immediately, doesn't wait for response Sends contents of stdin, waits 5 seconds, sends FIN and quits
    Debian 6.0 (nc [v1.10-38]) Sends contents of stdin, waits for response. Never sends FIN (ie leaves connection open). Sends contents of stdin + FIN; waits for either FIN from statsd, or 5 seconds, whichever comes first, and quits
    Mac OS X 10.6 (nc -h gives no version info) Sends contents of stdin + FIN; waits for statsd to respond and close connection before quitting Does not support the -q option.

    It seems that of the three netcat implementations I tested, they were all different in some way. The OS X nc seems most convenient -- it always closes the outgoing connection fast, but waits for the incoming data rather than quitting immediately. However, the lack of a timeout is dangerous -- without one, if statsd hangs, you will hang too. The Ubuntu nc is most painful -- you need to guess a timeout, but since you will always wait for the full timeout, you are punished for allowing a safety margin. And the debian nc is inconsistent: sometimes is closes the outgoing connection fast, and sometimes it doesn't, depending on whether you set the -q option.

    Overall, the most convenient for screwing around is the OS X nc; but I would suggest the most robust usage is debian nc with a timeout set.

    The ubuntu nc is singularly unfit for usage with echo in this way. What's amusing is that their man page even recommends this usage:

    $ echo -n "GET / HTTP/1.0\r\n\r\n" | nc host.example.com 80
    

    It also misses the -e option from echo to actually interpret the control characters correctly. If you try this with ubuntu nc against any but the fastest server, you won't get a response:

    $ # ubuntu 10.04
    $ echo -en "GET / HTTP/1.0\r\n\r\n" | nc www.google.com 80
    $
    

    As compared to the expected behaviour:

    $ # debian 6.0
    $ echo -en "GET / HTTP/1.0\r\n\r\n" | nc www.google.com 80
    HTTP/1.0 302 Found
    Location: http://www.google.co.uk/
    Cache-Control: private
    Content-Type: text/html; charset=UTF-8
    #...etc...
    

    Summary

    There are many flavours of nc out there, each with slightly different treatments of how to close a TCP connection. If you're getting unexpected behaviour from piping echo into netcat, it may be due to odd connection teardown in your netcat.

    Acknowledgements

    Thanks to bob for helping dig some of this stuff out, and for suggesting I try different netcat implementations. Thanks to the stanford networking course for teaching me solid foundations to draw on while investigating this.