Geek tales
Dec. 4th, 2007 08:40 amIt was 1996 when CompuServe bought the company at which I was working, Spry, and turned us into The CompuServe Internet Division. The idea was that this Internet Thing was becoming important and might someday be as important as CompuServe's own network, and so CompuServe decided that they needed some expertise and that we had it. They wanted our advice.
CompuServe at the time had phone banks scattered in closets throughout the country that their customers used to access the central repository in Columbus, Ohio. While CompuServe did not like the Internet, they did like TCP/IP and so over that summer they moved their network to a hybrid system that would support their old protocols at the network level while also supporting IP. One of the services they used was something called RADIUS[?] (Remote Authentication Dial In User Service), which allowed all their phone banks to do lightweight authentication of customer accounts when they dialed in.
CompuServe had reluctantly allowed Spry's 30,000 or so customers to keep their Spry accounts rather than force them to move over to the CompuServe Information Service. CompuServe at the time had ten times as many customers. Spry was ordered to use a specific RADIUS server by a specific vendor, connected to an Oracle database.
My development partner, Brad, and I worked hours trying to get this beast to work. First of all, it was an NT product that had been "ported" to Solaris (Spry's OS of choice). It leaked like a sieve, eventually crashing. It had to run as root so it would crash the whole server, not just its own process. Its communication with Oracle was flaky at best.
We had a conference with the vendor. He said, "Yeah, we know about the leak. We're not going to fix it, the Solaris product isn't a great seller. Look, just reboot the server every night. That's what the NT people do." Brad and I were aghast. "Oh, and the communications with Oracle isn't good either. Our people aren't as familiar with Solaris as they are with NT. Use the Solaris ODBC drivers from Oracle."
Under orders from bosses three layers up, we agreed to go with a restart regimen for the servers. Then we actually saw the post-evaluation bill for the ODBC drivers: $30,000 each. And we needed two: one for the primary, one for the failover.
After five weeks of this, Brad and I had had enough. We both went in one Saturday to fix the problem dead. We downloaded the reference implementation of RADIUS, a GPL'd solution. I wrote a bog-stupid forking server to watch a delivery queue and feed the requests to the Oracle database, and then take Oracle's responses and drop them into another delivery queue. Brad wrote a new back-end for the RADIUS reference server that would use my delivery queues, unpacking authentication requests from the outside world and packing up the responses. Back then, Brad was better at building robust servers, and I was a better Oracle guru.
I got my boss, Tim, to agree to a switchover, and we ran it a week without a hitch. Then another, then another. Finally, we took it to his boss, Stuart, and told him what had been running the authentication server all this time. He said (in his fine Scottish accent), "That sounds just like something you two would come up with. We'll leave it. Don't tell the boys in Columbus."
It ran fine for six months without any downtown at all. And then we got a call from Columbus: their entire RADIUS "server wall" was down. For 48 hours, SpryNet (as our customer base was called) was the only part of CompuServe still running.
Brad and I were ordered to fly out to Columbus to give a speech on how we had achieved so much stability that our RADIUS server never even appeared in the bugbase. Stuart said, "You're going to have to tell them. You always say working code is better than theory." Brad and I went.
In the CompuServe buried bunker with its halon systems and its fluorescent lighting Brad and I showed our system: two Solaris boxes running the RADIUS servers, and two Solaris boxes running Oracle databases, with one cold spare. We explained to them the premise of the system, the implementation decision we used, and the monitoring kit we'd put around it. We made a pitch for the open source nature of the server, explaining that we'd had to publish the queue manager because it was written directly into the RADIUS server, but showing that the tradeoff had been worth it: several contributors had sent back fixes for range checking and buffer overflows that we'd missed.
To say they were horrified is to downplay their reaction. They were an all-NT house. They were completely beholden to Microsoft for their software, and couldn't possibly consider installing Solaris. They had thousands of NT boxes; shifting to Microsoft products had been upper management's brilliant idea for modernizing CompuServe and bringing it out of its dark ages of six-bit CPUs and DEC PDP-11s. Going to Unix, an OS 20 years older than NT, would be "a step backwards."
Worse yet, we had given away CompuServe intellectual property. We had developed something in-house and just given it away without licensing agreements. I said there was a licensing agreement with the software, it was called the GPL, and agreeing to it had saved SpryNet $90,000 (the commercial RADIUS servers were $15,000, plus $30,000 for the Oracle ODBC shims, each) and gotten us some very valuable tech support beside. One of CompuServe's techs actually said, "The GPL's not a real license."
They were very proud of their "RADIUS Wall": 50 RADIUS servers fronting 50 Oracle servers, with a 10-and-10 collection of spares. They had had a catastrophic failure in one Oracle NT server and the resulting failure had corrupted the copies in all fifty; that had been their meltdown that had thrown them off the air until they could re-install enough NT instances and Oracle and restore from backup to ensure a decent quality of service. It had taken them two 24-hour days of nonstop manual labor to recover. One of the engineers said, "We have 10 times as many customers as you guys. We couldn't do it your way."
I did not point out they had 25 times as much hardware, and they still couldn't keep it going. I did not point out that they had marginally higher labor costs because of the manpower needed to maintain their server farm. I did not point out that they had spent $15,000 per server for that commercial junk RADIUS server, whereas we had gotten ours more or less for free. I did not point out that our customer base hadn't just spent 48 hours wondering when they'd get their next Internet fix.
Ultimately, CompuServe learned nothing from our experience. This became a pattern with them: they would ask our advice on something since we were their Internet experts, they would listen, and then do nothing with the advice we gave them. They had bought us for our internet cachet, not our expertise after all.
I learned a lot: about writing servers, about the undeniable value free software has to the infrastructure of the Internet, about how using the GPL can actually save a company money if used correctly and honestly, about dealing with databases, and about dealing with managers. I'm still not good at the last because I've come to understand that managers are irrational people caught between two crushing stones: the one from the top that controls the money, and the one from underneath that may bring creative destruction raining down.
CompuServe at the time had phone banks scattered in closets throughout the country that their customers used to access the central repository in Columbus, Ohio. While CompuServe did not like the Internet, they did like TCP/IP and so over that summer they moved their network to a hybrid system that would support their old protocols at the network level while also supporting IP. One of the services they used was something called RADIUS[?] (Remote Authentication Dial In User Service), which allowed all their phone banks to do lightweight authentication of customer accounts when they dialed in.
CompuServe had reluctantly allowed Spry's 30,000 or so customers to keep their Spry accounts rather than force them to move over to the CompuServe Information Service. CompuServe at the time had ten times as many customers. Spry was ordered to use a specific RADIUS server by a specific vendor, connected to an Oracle database.
My development partner, Brad, and I worked hours trying to get this beast to work. First of all, it was an NT product that had been "ported" to Solaris (Spry's OS of choice). It leaked like a sieve, eventually crashing. It had to run as root so it would crash the whole server, not just its own process. Its communication with Oracle was flaky at best.
We had a conference with the vendor. He said, "Yeah, we know about the leak. We're not going to fix it, the Solaris product isn't a great seller. Look, just reboot the server every night. That's what the NT people do." Brad and I were aghast. "Oh, and the communications with Oracle isn't good either. Our people aren't as familiar with Solaris as they are with NT. Use the Solaris ODBC drivers from Oracle."
Under orders from bosses three layers up, we agreed to go with a restart regimen for the servers. Then we actually saw the post-evaluation bill for the ODBC drivers: $30,000 each. And we needed two: one for the primary, one for the failover.
After five weeks of this, Brad and I had had enough. We both went in one Saturday to fix the problem dead. We downloaded the reference implementation of RADIUS, a GPL'd solution. I wrote a bog-stupid forking server to watch a delivery queue and feed the requests to the Oracle database, and then take Oracle's responses and drop them into another delivery queue. Brad wrote a new back-end for the RADIUS reference server that would use my delivery queues, unpacking authentication requests from the outside world and packing up the responses. Back then, Brad was better at building robust servers, and I was a better Oracle guru.
I got my boss, Tim, to agree to a switchover, and we ran it a week without a hitch. Then another, then another. Finally, we took it to his boss, Stuart, and told him what had been running the authentication server all this time. He said (in his fine Scottish accent), "That sounds just like something you two would come up with. We'll leave it. Don't tell the boys in Columbus."
It ran fine for six months without any downtown at all. And then we got a call from Columbus: their entire RADIUS "server wall" was down. For 48 hours, SpryNet (as our customer base was called) was the only part of CompuServe still running.
Brad and I were ordered to fly out to Columbus to give a speech on how we had achieved so much stability that our RADIUS server never even appeared in the bugbase. Stuart said, "You're going to have to tell them. You always say working code is better than theory." Brad and I went.
In the CompuServe buried bunker with its halon systems and its fluorescent lighting Brad and I showed our system: two Solaris boxes running the RADIUS servers, and two Solaris boxes running Oracle databases, with one cold spare. We explained to them the premise of the system, the implementation decision we used, and the monitoring kit we'd put around it. We made a pitch for the open source nature of the server, explaining that we'd had to publish the queue manager because it was written directly into the RADIUS server, but showing that the tradeoff had been worth it: several contributors had sent back fixes for range checking and buffer overflows that we'd missed.
To say they were horrified is to downplay their reaction. They were an all-NT house. They were completely beholden to Microsoft for their software, and couldn't possibly consider installing Solaris. They had thousands of NT boxes; shifting to Microsoft products had been upper management's brilliant idea for modernizing CompuServe and bringing it out of its dark ages of six-bit CPUs and DEC PDP-11s. Going to Unix, an OS 20 years older than NT, would be "a step backwards."
Worse yet, we had given away CompuServe intellectual property. We had developed something in-house and just given it away without licensing agreements. I said there was a licensing agreement with the software, it was called the GPL, and agreeing to it had saved SpryNet $90,000 (the commercial RADIUS servers were $15,000, plus $30,000 for the Oracle ODBC shims, each) and gotten us some very valuable tech support beside. One of CompuServe's techs actually said, "The GPL's not a real license."
They were very proud of their "RADIUS Wall": 50 RADIUS servers fronting 50 Oracle servers, with a 10-and-10 collection of spares. They had had a catastrophic failure in one Oracle NT server and the resulting failure had corrupted the copies in all fifty; that had been their meltdown that had thrown them off the air until they could re-install enough NT instances and Oracle and restore from backup to ensure a decent quality of service. It had taken them two 24-hour days of nonstop manual labor to recover. One of the engineers said, "We have 10 times as many customers as you guys. We couldn't do it your way."
I did not point out they had 25 times as much hardware, and they still couldn't keep it going. I did not point out that they had marginally higher labor costs because of the manpower needed to maintain their server farm. I did not point out that they had spent $15,000 per server for that commercial junk RADIUS server, whereas we had gotten ours more or less for free. I did not point out that our customer base hadn't just spent 48 hours wondering when they'd get their next Internet fix.
Ultimately, CompuServe learned nothing from our experience. This became a pattern with them: they would ask our advice on something since we were their Internet experts, they would listen, and then do nothing with the advice we gave them. They had bought us for our internet cachet, not our expertise after all.
I learned a lot: about writing servers, about the undeniable value free software has to the infrastructure of the Internet, about how using the GPL can actually save a company money if used correctly and honestly, about dealing with databases, and about dealing with managers. I'm still not good at the last because I've come to understand that managers are irrational people caught between two crushing stones: the one from the top that controls the money, and the one from underneath that may bring creative destruction raining down.
no subject
Date: 2007-12-05 09:37 am (UTC)I've accidentally bumped into the wonders of Kerberos at work, and have been mucking around trying to get it to work. NVIDIA is a PC-centric organization, which makes life for the Mac developers tougher. All the PCs have single sign-on via NTLM, but the Macs just flap in the breeze, with everyone typing in their credentials every five minutes.
I happened to discover that one of the LDAP tools that ships with Mac OS X will transparently authenticate via Kerberos. After futzing around for about an hour with 'kinit' and friends, I got an LDAP query through without having to enter a password.
Then I found out that Firefox can also authenticate via Kerberos, and excitedly plowed straight into a brick wall of failure. I couldn't get it to work. Oh well...
no subject
Date: 2007-12-21 08:04 pm (UTC)