Surge Wrap-up

September 19, 2013 - by David Pacheco

Last week, several Joyent engineers attended Surge in Washington, D.C. The highlight for several of us was the keynote by Gene Kranz (of Apollo 13 fame) about his days at Mission Control. Far from being out of touch, Kranz's experience from forty years ago has real lessons for people building complex systems today. Throughout his presentation recounting the myriad challenges leading up to and including Apollo 13, Kranz emphasized the culture that enabled the team to succeed, focusing on trust and integrity.

As Bryan mentioned later in his talk (about scaling organizations by scaling engineers), Kranz described an important tension that applies as well to modern software engineering as it did to the space program, which is the incredible confidence (bordering on audacity) to try sending humans to the moon and back, balanced with the extraordinary humility that comes from daily experience with the complex systems we build (and the complex failures they experience). That really hit home for the Apollo team after the Apollo 1 fire, when the whole team took a step back to reassess their understanding of the system.

Perhaps the most insightful remark came in response to the question of how NASA hired mission controllers: they looked for people who wanted to do something rather than be something.


On a lighter note, at the Joyent booth, we collected more data (ahem) for Kartlytics, our project to bring "Big Data" analytics to Mario Kart 64:

In preparation for Surge, I added a few new features to the kartlytics web site. If you click on an individual race, you can see a diagram of players' ranks over the course of the race:

We use that same data to generate the front-page list of photo finishes and "wildest finishes" -- that is, those having the most rank changes in the last 15 seconds of the race.

We also finally answered that question of great interest to those of us who play a lot of Kart: how does the game decide which weapons to give to each player (based on their rank)?

The first place player is very likely to get banana peels and green shells, while the fourth place player is pretty likely to get a star, lightning, or other great weapons (as everyone knows, of course).


Mark and I spoke about the internals of Manta, as well as the Unix history that led us to build it. We demo'd an arbitrarily scalable variant of Doug McIlroy's solution to Jon Bentley's "word frequency count" challenge:

mfind /manta/public/examples/shakespeare | \
    mjob create -o -m "tr -cs A-Za-z '\n' | \
    tr A-Z a-z | sort | uniq -c" -r \
    "awk '{ x[\$2] += \$1 }
    END { for (w in x) {
        print x[w] \" \" w } }' |
    sort -rn | sed ${COUNT}q"

You can run that (replacing some integer for $COUNT) to get a list of the top $COUNT words in our example corpus. More importantly, that code scales without modification to very large numbers of objects.

For those wanting a better overview of how to actually use Manta, Konstantin from Wanelo gave a great talk about how they use Manta for large-scale data analysis at Wanelo.

As usual, Artur gave a great talk, this time on the importance of latency in distributed systems, and how important it is to fully understand a system in order to make it perform. He also described the "commodity machines" that run

For the curious: by comparison, we traded off DRAM capacity for DRAM speed on our Mantis Shrimp boxes. Artur also asked why anyone would still use spinning disks, to which the answer is, of course: cost, and the fact that latency can be greatly reduced with a large filesystem cache!

I was particularly intrigued by Richard Crowley's talk about a custom datastore they built at Betable. As I mentioned in our talk, we've been considering alternative data stores for managing distributed job state in Manta. Like Richard, we found that there wasn't anything out there that matched what we needed, but we've spent some time thinking about what such a system would look like. (Ultimately we decided to use Postgres until we knew better why that wouldn't work, but it was interesting to hear about Betable's implementation and operational experience.)

The hallway track has always been one of the best parts of Surge. Now in its fourth year, the conference has collected its share of regulars, but also attracts plenty of new people. After raising doubts about our use of Postgres that I alluded to above, I had some good conversations with folks having success pushing Postgres as hard as we are. In classic Surge fashion, Mark and I also presented a terrifying problem we'd seen exactly once with postgres replication -- and Rob Treat gave us a theory that we intend to pursue to root cause the failure.

As usual, the lightning talks were a blast. Besides learning how to tie our shoes and take better selfies, Brendan shared his experience with benchmarks gone wrong and Bryan explained how a pleasure cruise became an odyssey: the epic story of "tail -f". We'll link to the videos when they're available.

Thanks to OmniTI for putting the conference together, to the other sponsors for making it happen, and everyone who attended!