Strongspace Back

Strongspace service has been restored. I will be posting a rather long post tomorrow about what happened, etc. No data was lost. That is thanks to ZFS. Thanks, ZFS.

Update: Strongspace has been panicking on writes to the filesystem. We’ve been talking with Sun, and here’s the latest from Ben:

The answer is that it the “spacemap bug” causes errors on disk, apparently by double-writing the blocks. The pain comes when ZFS later realizes this and tries to free one of them which is already supposed to be free. ZFS thinks that something went wrong and may corrupt data so it panics. In fact, ZFS is apparently doing the right thing by freeing the duplicate that shouldn’t exist. The only way to get ZFS to clean this mess up itself is to let the “bug” be hit while the system has both “aok” and “zfs_recover” enabled so that it fixes the problem rather than panics.

The “fix” is to enable aok and zfs_recover, wait for the warning (rather than a panic) to be hit and then to do a clean shutdown and we’re “ok”… however, this has happened once already, suggesting that there are multiple spacemaps effect. We will leave it in a “recovery” state for an extended period of time, such as 24hr, and see how many instances it comes across during that time.

I am thus proceeding with this place. The system is rebooting now, aok and zfs_recover are set. I’ll put it back into production for a period of 24 hours and we’ll watch it closely. The hope will be that we see several “free blocks freed” and that Tuesday night we can do a clean shutdown and be beyond this silliness. Fingers crossed.

Strongspace should be back on-line shortly.

This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

27 Responses

Comment this article