Some of you may have noticed that BitSyncHub (and my company homepage) went down for almost three days this week.
Now, I’m proud of my skills building highly available, fault-tolerant, state-of-the art and bleeding edge systems, deployed on OS, vendor and geographically redundant infrastructure, that never goes down. Sometimes a failover may take a short while, or parts of a system may get bogged down and queues build up, but the system, while on it’s knees, may never fail.
So how, do you ask, does your own homepage and the single service hosted there go down for several days?
Well, let’s just say all systems I build have a single point of faliure in the actual execution stage – me. When I set up the homepage for my new company, I just tossed something up right there on a virtual host; installed lighttpd and emacs, created the index page, done. When I added BitSyncHub, it was for my own use, so again, I just installed uWSGI and Celery, created the service, and then, as an afterthough, I made it public so that others could use it.
Since this was all installed on a OpenVZ server, backed up and secure, I thought that I was fine – nothing mission critical, almost no users, and if anything happened, there was always the backup.
The backup kept me secure for anything except for meatware errors.
When I noticed a spike in usage of BitSyncHub, I figured I should perhaps secure the service against extended downtime – just because a service is free it shouln’t be unreliable. I was thinking that I might set a new server up and try out Docker at the same time. Let’s enter my mind at the time:
So, I’ll fire up a new OpenVZ instance and… oh, right, wrong kernel version, I need a VMWare instace to be able to choose kernel. Dum-de-dum – let’s remove the new server. Click-click-click. There, all gon… hm. It’s still there, huh. [SUDDEN SENSE OF DOOM] Eh – where’s my production server?
Yeah, I removed the server, and with it I also removed the backups. Suddenly, the homepage, BitSyncHub, and the machinery it had been running on was gone – forcing me to recreate it all from scratch.
So if you find any bugs that suddenly appeared in the BitSyncHub service, drop me a mail and I’ll fix it. I’ll be over here contemplating the fact that if you’re considering yourself Hot Stuff when it comes to high availability and resilience when working at customer projects, maybe you should apply that to your own products as well.