November 19, 2010 - by konobikun
Several years back, when I was working for a company that sold high performance appliances, we received a report from a customer that performance was atrocious and nowhere near our indicated throughput by several times over. Obviously we needed to work out what was going on. To the tools mobile!
Thankfully the appliances had the ability (with the customer's permission) to dial home to our engineering centre for us to be able to get access to the machine using a variety of tunnelling methods. Using some custom pre-installed tools we were able to verify that the appliance was configured correctly and that the level of performance that we expected was possible. However based on the statistics that the appliance collected it was obvious that the customer wasn't seeing this performance at their install site.
We had reached the limits of our pre-installed tools designed to debug these sorts of issues, so now we had to investigate deeper.
From the group of statistics that we had available to us, we were able to discern that each transaction was taking a long time to process and due to queue limits built in to the appliance we were hitting an exponential back-off routine designed to run if the appliance was under high load. The appliance however, was showing minimal load, so we were really confused. After a day or so of rummaging around through the system we couldn't see anything obvious, at the software or hardware level, so even deeper debugging was required.
Another day later and we had hand installed about one hundred logging and timing probes to trace the path of a transaction through the system. After an hour or two we finally had an indication! Evidence showed that during the network phase of the transaction we were seeing huge periods of time (in the region of 5-10 minutes) being taken to process before moving along. Hmmm. Curiouser and curiouser.
Another four or five hours later we had narrowed the network traffic that was being affected down to DNS queries, so off we go to do 'dig google.com.' And bam! The result comes back instantly. Wait, huh? After trying a bunch of different types of requests (A/NS/TXT/PTR), the result was the same. The system did many different types of requests, but nothing was showing up as having any issues with timing out. The hell?
Having had some experience with dealing with DNS at the protocol level after working with a friend debugging some custom DNS software, I decided to ignore the DNS tools and get to the heart of the situation and just look at the packets. To achieve this I ran tcpdump on the appliance capturing DNS traffic on several ports and interfaces and dumped about 30 minutes worth of traffic into a pretty huge file. I then retrieved the file and used wireshark to give me a more palatable way to debug.
Looking through the stream of requests I could see the requests being sent to the DNS server on site, but there was a very long period of time until I got a response. I couldn't understand why this was happening since running the requests myself would be instantaneous. So back onto the server I went and combined tcpdump with manual dns requests. Again I found that the requests were instantaneous, so off to download the dumpfile I went.
This time, the stream was the same, but without the long periods of time. Funnily enough, during this period a "real" request was sent and I decided to compare the contents of the packets, and ah ha! There was a difference. The "real" requests DNS header size was much larger than that of my test requests, and my brain suddenly twigged that the "real" requests were being sent with EDNS headers and my test requests were being sent with basic DNS headers.
Having worked this out, I adjusted my test requests to be
dig +edns google.com and finally I saw the same long period for the request. Knowing that EDNS was the issue, I remembered that ISC BIND (which we were running on the appliance) by default will alway use EDNS for recursive requests. Knowing this, I figured the issue was with the DNS server on the customer install site, so some emails back and forth with the customer and we got some information from them that they were running a Windows 2000 server instance to do their DNS.
After rolling through 30-50 forum posts on a variety of Windows support sites, I finally had my answer. It appears that when the DNS service on Windows 2000 server (at some service pack level) receives an EDNS request it just derps. Instead of refusing the request or notifying downstream that it doesn't understand, it just held the connection open until the TCP timeout for the socket was reached. Awesome! An unhandled exception that didn't clean up after itself. Thankfully and somewhat logically, BIND when it receives a timeout for an EDNS request will fall back to a basic DNS request. While useful and totally understandable, the behaviour makes things confusing at a higher level.
Responses to the forum posts indicated that a service pack update for the system would enable EDNS for the system and it would work. However, knowing that the customer was none to pleased about their situation, we suggested that they update to the service pack, but that we'd have a work around for them within a day.
A couple of hours later, we had updates to our configuration and templating system that disabled EDNS queries in BIND and the customers appliance rocketed through the backlog and the performance level was met. Within a week, we had a fix in place that allowed customers to disable EDNS requests from our interface and the update was rolled out.
We had found ourselves on quite the path to work out was the issue and it ended up not even being a problem with our system! However, this experience taught me a valuable lesson, that even though you may think you know how something should work, unless you're willing to get right down to the core of the problem, you're going to get kicked in the ass. The other valuable lesson was that even though the problem wasn't with our system, going out of your way to make the experience for your customer totally optimal reaps rewards. After this very hand-held support experience, the customer went on to purchase several more appliances and other products from the company.
This was an issue that was affecting a customer and it took 5-6 days to debug to find a solution. While this appliance was based on FreeBSD, dtrace probes weren't available to us. But I'm confident that having the extra amount of tooling in place would have shaved a couple of days off that time. Those extra couple of days would have made the support process much easier from a customer management point of view.
So what should you take from this entry? Tooling, tooling, tooling! Know your tools, know your protocols and make sure you have the ability to roll out your fixes quickly and effectively.
Photo by Piotr Esden-Tempski.