Strongspace and Bingodisk: Update

High-level Explanation

Joyent is working to bring our Strongspace and BingoDisk products back on-line after they were taken offline this past Saturday (January 12) due to instabilities (e.g. read errors, checksum failures) Joyent has experienced with Sunfire X4500 hardware (aka “Thumper”) only exposed after a series of upgrades to the operating system itself. A ZFS bug prevented a speedy recovery. No other Joyent products have been affected. Strongspace and BingoDisk are the only services running on X4500s. The rest of our X4500 inventory is used for cold backups of Joyent server nodes.

ZFS continues to be a competitive advantage for Joyent and our customers. Typical nodes at Joyent running ZFS can recover from crashes in a matter of seconds. As I said, the interruption of Strongspace and BingoDisk has not affected other Joyent services.

Technical Explanation

The nature of the Strongspace and BingoDisk products has exacerbated the recovery of these services. Each service runs together on a single Sunfire X4500. The X4500 is a dual-socket, 48 by 500GiB drive server/storage device. It has been pointed out elsewhere that we were running an older version of the OpenSolaris operating system on this X4500. That is true. However, since this particular X4500 also housed two services (rather than just backups), we had been waiting to upgrade the X4500 in anticipation of some software updates that were/are in the pipeline for OpenSolaris itself. The improvements include an improved “scrub” (which can stall or hang, currently), faster ZFS lists (sometimes take more than an hour to list datasets on machines such as an X4500 with sizeable data), the ability to recursively replicate datasets, and graceful recovery when a device (i.e. drive) fails. Unfortunately, OpenSolaris does not currently provide a straightforward upgrade process from build-to-build. If all the stars aligned, an upgrade takes about six hours. Realistically, we estimated we would have needed to schedule a multiday downtime given the historical uncertainties around importing zpools from older version of ZFS into newer versions of ZFS. Further, the sheer amount of data managed by these services means moving data around for recovery purposes takes lots of time. We’re up against the laws of physics. As predicted, the upgrade of the operating system, in response to the interruption, went relatively smoothly. We have been working for 3 days to get the data usable. We have swapped X4500 chassis completely. Work continues.

Conclusion

We continue to work to restore Strongspace and BingoDisk. We are still in the midst of the data recovery process. I will continue to update this post as I learn more. If you have questions, please post a comment and we’ll try to answer.

Update (Thursday, 17 January, 6:30am Pacific)

In layman’s terms, we have been struggling to get the data off affected Thumpers (X4500s) without the system locking up. However, there’s positive news to report this morning. From the trenches (from Ben Rockwood):

Around 6:30pm (Wednesday, 16 January, Pacific) Mark [Mayo] brought over his awesome PERL script which manage parallel rsync (which he implemented in San Deigo and has used with great success elsewhere). He kicked off that process but shortly there after data ceased moving and everything locked up. This was almost identical to what I experienced the night before (although I was using zfs send/recv in a loop rather than rsync). In both cases all IO ceased and everything blocked behind a single disk. The frustrating thing was that although both events were identical, they were blocked on different disks.

I recalled attacking various read error problems on Alpha in San Diego, similar to what we’re seeing on Thumper2 and re-invested myself in that reseach. As a result I’ve disabled NCQ on Thumper2. SO FAR the result is extremely positive! I imported the pool and have been using Mark’s uber-rsync to move data off and so far not a single bus reset or read error!! Fingers crossed!

I’m heading to bed now, but I’m really encouraged with the progress right now. Data is moving off cleanly without errors at a rate of 50MB/s!!! This is a significant improvement over the 25MB/s I was getting previously.

I will provide another update later this morning.

Many of you have asked about the compensation Joyent will be providing to those affected by this outage. It will be generous. I’d like to get service restored before I announce our compensation plan.

Update (11am, Thursday, 17 January)

A further update from Ben Rockwood:

We’ve been fighting and fighting with Thumper2. Previous attempts to copy data off went for a while and then wedged. However, this morning, I’m excited to say the transfers started last night at 3AM are still running!

And! No read errors like we were gettings piles of earlier!!! With over 6 hours of transfer not a single error!

When the data has been recovered, we can restore service. I’ll continue to update.

Update (7:45am, Friday, 18 January)

We’re expecting to be able to come back on-line by mid-afternoon today (Pacific time).

Update (9:00am, Sunday, 20 January)

As noted yesterday, Bingodisk is up. This from Ben, regarding Strongspace:

The final resilver on Thumper2 is approx 8 hours from completion. We’ll bring the services back online early afternoon. No further complications are expected from this point on.

Strongspace will be back on-line today.

This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

84 Responses

Comment this article