Strongspace Back
Strongspace service has been restored. I will be posting a rather long post tomorrow about what happened, etc. No data was lost. That is thanks to ZFS. Thanks, ZFS.
Update: Strongspace has been panicking on writes to the filesystem. We’ve been talking with Sun, and here’s the latest from Ben:
The answer is that it the “spacemap bug” causes errors on disk, apparently by double-writing the blocks. The pain comes when ZFS later realizes this and tries to free one of them which is already supposed to be free. ZFS thinks that something went wrong and may corrupt data so it panics. In fact, ZFS is apparently doing the right thing by freeing the duplicate that shouldn’t exist. The only way to get ZFS to clean this mess up itself is to let the “bug” be hit while the system has both “aok” and “zfs_recover” enabled so that it fixes the problem rather than panics.
The “fix” is to enable aok and zfs_recover, wait for the warning (rather than a panic) to be hit and then to do a clean shutdown and we’re “ok”… however, this has happened once already, suggesting that there are multiple spacemaps effect. We will leave it in a “recovery” state for an extended period of time, such as 24hr, and see how many instances it comes across during that time.
I am thus proceeding with this place. The system is rebooting now, aok and zfs_recover are set. I’ll put it back into production for a period of 24 hours and we’ll watch it closely. The hope will be that we see several “free blocks freed” and that Tuesday night we can do a clean shutdown and be beyond this silliness. Fingers crossed.
Strongspace should be back on-line shortly.







27 Responses
Is that thanks to ZFS for the bug, or for the recovery, or both?
Me, I’ll point my thanks toward BenR.
I’m not sure that it is up. I can’t connect with Interarchy or the web interface.
i also can’t seem to log in….
j
It’s not up currently:
http://help.joyent.com/index.php?pg=forums.posts&id=708&pc=1
Kristie sent email, too.
It’s running again.
It still doesn’t work for me (9h30 GMT+1). Same error as before when trying to rsync data from Excelsior to my Strongspace account:
ssh: connect to host *.strongspace.com port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(453) [sender=2.6.9]
http://help.joyent.com/index.php?pg=forums.posts&id=709&pc=1
Fix it , please.
@thomas the systems team are busy with it since this morning. Hopefully it will be back soon!
Just to echo the previous messages – the thing still isn’t working. Sub domains aren’t working, SFTP isn’t working, none of it is working…….
Uh, none of it is working and none of it has been working for DAYS!!! Why do we continue to see posted messages saying — confidently — “it’s back” or “it’s on line” when in fact, the thing is whacked. I am “hopefully” hoping the busy people will get it un-whacked soon. Please fix, please advise.
While I appreciate the hard work you guys are doing to get everything back online, I’m starting to find it unacceptable that we’ve unable to use the service for over a week.
We rely on the service for client data transfers that are critical to our business. When our non-technical clients ask why there seems to be no redundancy built into a service that is likely used by many for business critical purposes, I find myself with no explanation.
Finally, I find the biggest issue to be the lack of communications from Joyent (outside of this blog). We had to search for all information ourselves, when you could have easily mass-mailed a notice to all customers.
Just to reiterate the other comments – I’m anxious to have this service back online, no more false starts please !!!
I actually think Joyent have behaved fairly professionally over this so far – I appreciate the tone of their communications, and I did get a couple of emails (although they perhaps took a little long to arrive).
However, the false starts are unfortunate… in a situation like this, given everything that’s happened, I would expect them to fully test things (and double check them) before claiming that things are back to normal.
Of course I can afford to be understanding, because luckily I haven’t (yet!) had a disk crash during the outage…
At some point, it’s probably time to question your choice of suppliers (Sun). The only question is whether that’s before or after your customers leave. Me? I now have an account at http://www.box.net. It’s not a mean-spirited thing, I just have clients who rely on having large files from me in a timely manner. Good luck with this!
I’m absolutely blown away that there were no backups or redundant hardware for strongspace . From your other comments it seems you were relying on zfs snapshots on the same physical machine to cover your ass. And to make it worse, it was running an old release which had known corruption issues.
I realize your systems team has been working hard at this, if there were backups on redundant hardware it would have been perhaps a few hours work.
I’d threaten to take my business elsewhere, but it’s unlikely you’ll care as you’ve just end of lifed both products!! I will, however, start moving my stuff off your accelerators and onto someone who can be trusted to back things up.
Anon so you don’t shut my site down out of spite.
So, Strongspace was a long, failed experiment?
I hope, hope, HOPE that the EOL messages are actually in prep for a better service that you’re going to roll out to all of us for free. However, it’s looking more and more like the headache over the past week has turned into a towel-throwing exercise.
That’s fine, but what happens to the mixed-grill guys?
I’m sure you’ve got this all worked out, but once again, it’s all mixed messages and unclear direction. Why announce EOL on the signup page BEFORE telling your paying customers what’s going on?
I’m so confused.
@Alex: please wait for my post. That will be the definitive word of product futures.
Okay, so, apart from the Web interface and some user management, StrongSpace ain’t doing anything I can’t provide to my peeps myself. Bit of DNS and a client-side app, and they can transit anything to and from one of the boxen here at the house. All StrongSpace gets me is reliability (ahem), a sense of heft, and not having to admin the thing.
So if y’all Joyos want outta SS, just say so, maybe release the scripts. I’ll bolt them onto some hosted space or something, or not. VC here. All gravy to me.
And where’s le TextDawg, anyhow? The eBay hookers I sent over to son maison bailed, texting loud complaints about les chiens méchants. Big PayPal dispute. Need testimony.
LQ
I agree re communication. Surely we should all have been emailed. Which is the same complaint made every time. Not everyone is constantly checking these sites, especially the forums.
Please email!
Secondly, there is a lot of end of life panic floating around. I’d like to know what is replacing strongspace if it is in fact going.
I hope it’s not be the ‘strongspace’ inside Connector. Strongspace’s interface is infinitely better than Connector’s, for file management.
I wouldn’t go that far. Until this point they’ve been the best as far as I’m concerned. My beef is with communication at the moment.
Until we have the report on what went wrong and why it took so long to fix, there is no point judging anything else.
@digg For goodness sake, get a grip. And have the balls, or ovaries, to identify yourself.
As for communication, am I the only one who’s received a steady stream of email from Kristie?
Ryan’s right and I bet that those of us who’ve been long enough with TxD and Joyent have had something to say about their crisis communication. But fingering them like the knee-jerk Digger above is plain stupid. Shit happens in datacenter, it’s a fact of life, and I trust them to go the extra mile until it’s sorted out. They’ve consistently done it before and I’m happy I stayed here since almost the beginning. Because for me too, they’ve been the best so far.
Now those who demand “perfect crisis communication” may want to test elsewhere how shitty service with canned responses from robots can help their business better.
@andrew I too received e-mails from Kirstie throughout the whole thing.
@andrew barnett & @ace: This morning I received my very first email from Joyent regarding the Strongspace issue.
I’m not sure if I’m more disturbed by not being emailed at all or by knowing that some people were emailed and I was not, which suggests to me that Joyent doesn’t even have its act together enough to have a consolidated list of customers.
(I’ll grant that it’s conceivable that any other emails sent to me from Joyent in the last week+ were caught in my spam traps and I didn’t notice them when I looked through my logs. But I think that’s highly improbable.)
@Alex: you can leave your last name. We are professionals here and would not do anything out of spite. I think we deserve a little more credit than that.
@Bob: I have been using the same mailing list for the past week when sending emails. So I would check your spam filters. If you got one, you would have gotten them all.
@Digg: I deleted your comment. There is a big difference in being frustrated and venting and just being an anonymous ass.