Geek tales

Dec. 4th, 2007 08:40 am
elfs: (Default)
[personal profile] elfs
It was 1996 when CompuServe bought the company at which I was working, Spry, and turned us into The CompuServe Internet Division. The idea was that this Internet Thing was becoming important and might someday be as important as CompuServe's own network, and so CompuServe decided that they needed some expertise and that we had it. They wanted our advice.

CompuServe at the time had phone banks scattered in closets throughout the country that their customers used to access the central repository in Columbus, Ohio. While CompuServe did not like the Internet, they did like TCP/IP and so over that summer they moved their network to a hybrid system that would support their old protocols at the network level while also supporting IP. One of the services they used was something called RADIUS[?] (Remote Authentication Dial In User Service), which allowed all their phone banks to do lightweight authentication of customer accounts when they dialed in.

CompuServe had reluctantly allowed Spry's 30,000 or so customers to keep their Spry accounts rather than force them to move over to the CompuServe Information Service. CompuServe at the time had ten times as many customers. Spry was ordered to use a specific RADIUS server by a specific vendor, connected to an Oracle database.

My development partner, Brad, and I worked hours trying to get this beast to work. First of all, it was an NT product that had been "ported" to Solaris (Spry's OS of choice). It leaked like a sieve, eventually crashing. It had to run as root so it would crash the whole server, not just its own process. Its communication with Oracle was flaky at best.

We had a conference with the vendor. He said, "Yeah, we know about the leak. We're not going to fix it, the Solaris product isn't a great seller. Look, just reboot the server every night. That's what the NT people do." Brad and I were aghast. "Oh, and the communications with Oracle isn't good either. Our people aren't as familiar with Solaris as they are with NT. Use the Solaris ODBC drivers from Oracle."

Under orders from bosses three layers up, we agreed to go with a restart regimen for the servers. Then we actually saw the post-evaluation bill for the ODBC drivers: $30,000 each. And we needed two: one for the primary, one for the failover.

After five weeks of this, Brad and I had had enough. We both went in one Saturday to fix the problem dead. We downloaded the reference implementation of RADIUS, a GPL'd solution. I wrote a bog-stupid forking server to watch a delivery queue and feed the requests to the Oracle database, and then take Oracle's responses and drop them into another delivery queue. Brad wrote a new back-end for the RADIUS reference server that would use my delivery queues, unpacking authentication requests from the outside world and packing up the responses. Back then, Brad was better at building robust servers, and I was a better Oracle guru.

I got my boss, Tim, to agree to a switchover, and we ran it a week without a hitch. Then another, then another. Finally, we took it to his boss, Stuart, and told him what had been running the authentication server all this time. He said (in his fine Scottish accent), "That sounds just like something you two would come up with. We'll leave it. Don't tell the boys in Columbus."

It ran fine for six months without any downtown at all. And then we got a call from Columbus: their entire RADIUS "server wall" was down. For 48 hours, SpryNet (as our customer base was called) was the only part of CompuServe still running.

Brad and I were ordered to fly out to Columbus to give a speech on how we had achieved so much stability that our RADIUS server never even appeared in the bugbase. Stuart said, "You're going to have to tell them. You always say working code is better than theory." Brad and I went.

In the CompuServe buried bunker with its halon systems and its fluorescent lighting Brad and I showed our system: two Solaris boxes running the RADIUS servers, and two Solaris boxes running Oracle databases, with one cold spare. We explained to them the premise of the system, the implementation decision we used, and the monitoring kit we'd put around it. We made a pitch for the open source nature of the server, explaining that we'd had to publish the queue manager because it was written directly into the RADIUS server, but showing that the tradeoff had been worth it: several contributors had sent back fixes for range checking and buffer overflows that we'd missed.

To say they were horrified is to downplay their reaction. They were an all-NT house. They were completely beholden to Microsoft for their software, and couldn't possibly consider installing Solaris. They had thousands of NT boxes; shifting to Microsoft products had been upper management's brilliant idea for modernizing CompuServe and bringing it out of its dark ages of six-bit CPUs and DEC PDP-11s. Going to Unix, an OS 20 years older than NT, would be "a step backwards."

Worse yet, we had given away CompuServe intellectual property. We had developed something in-house and just given it away without licensing agreements. I said there was a licensing agreement with the software, it was called the GPL, and agreeing to it had saved SpryNet $90,000 (the commercial RADIUS servers were $15,000, plus $30,000 for the Oracle ODBC shims, each) and gotten us some very valuable tech support beside. One of CompuServe's techs actually said, "The GPL's not a real license."

They were very proud of their "RADIUS Wall": 50 RADIUS servers fronting 50 Oracle servers, with a 10-and-10 collection of spares. They had had a catastrophic failure in one Oracle NT server and the resulting failure had corrupted the copies in all fifty; that had been their meltdown that had thrown them off the air until they could re-install enough NT instances and Oracle and restore from backup to ensure a decent quality of service. It had taken them two 24-hour days of nonstop manual labor to recover. One of the engineers said, "We have 10 times as many customers as you guys. We couldn't do it your way."

I did not point out they had 25 times as much hardware, and they still couldn't keep it going. I did not point out that they had marginally higher labor costs because of the manpower needed to maintain their server farm. I did not point out that they had spent $15,000 per server for that commercial junk RADIUS server, whereas we had gotten ours more or less for free. I did not point out that our customer base hadn't just spent 48 hours wondering when they'd get their next Internet fix.

Ultimately, CompuServe learned nothing from our experience. This became a pattern with them: they would ask our advice on something since we were their Internet experts, they would listen, and then do nothing with the advice we gave them. They had bought us for our internet cachet, not our expertise after all.

I learned a lot: about writing servers, about the undeniable value free software has to the infrastructure of the Internet, about how using the GPL can actually save a company money if used correctly and honestly, about dealing with databases, and about dealing with managers. I'm still not good at the last because I've come to understand that managers are irrational people caught between two crushing stones: the one from the top that controls the money, and the one from underneath that may bring creative destruction raining down.

Date: 2007-12-04 06:25 pm (UTC)
From: [identity profile] antonia-tiger.livejournal.com
I remember, in the nineties, occasional references on Demon Internet, to their use of RADIUS servers.

It wopuldn't surprise me if they were looking at your work, maybe even using it. A friend who worked there described their system as a work of insane genius. As customers we had fixed IP addresses with dial-up. The way the UK phone system works, they were able to set up a single phone number that connected from anywhere in the country to a single bank of modems. So the RADIUS system did authentication, and then a miracle occurred which set up IP routimng to whichever modem you were connected to.

(I over-simplify: the physical modems were divided between at least two sites.)

I'd be astonished if your work didn't feed into that.

And you right such good fiction too. Is there no end to your talents?

Might have been less difficult than you think...

Date: 2007-12-04 06:51 pm (UTC)
From: [identity profile] danlyke.livejournal.com
I'm trying to remember how the Livingston Portmasters worked, but I think that if you had them set up to query a RADIUS server they could assign a PPP or SLIP address that was any address on the network the Portmaster was plugged into.

I know we used 'em that when Chattanooga On-line was small, but that may have been when it was small enough that each phone bank talked to its own Portmaster. I left for the west coast before it had grown much bigger than that.

Not that it makes tying all of those technologies together any less impressive. These days we spend so much time working around limitations on technology that has been sold to the higher-ups that we don't get to do much really cool ad-hoc stuff.
From: [identity profile] zonereyrie.livejournal.com
Yes, RADIUS can assign an IP address and the PM will use it. But it does have to be within the routing domain for that PM or the traffic isn't going anywhere, of course. (I worked for Livingston 95-98.)

Date: 2007-12-04 07:07 pm (UTC)
From: [identity profile] elfs.livejournal.com
The only things we contributed to the RADIUS reference server were quite a few bugfixes and a handler that send the request into a shared memory queue, and that then listened on the queue for a poke from some backend authentication server.

If your company used the reference server it's entirely possible you got our bugfixes, but the queues were pretty implementation (and Solaris) specific, and one thing we did not have to publish was our authentication server (the thing I wrote that dialogued with Oracle) since it wasn't a part of the RADIUS server (in the same way that a browser is not part of Apache).

Still, I wouldn't be surprised if someone figured out other ways of using them.

Profile

elfs: (Default)
Elf Sternberg

December 2025

S M T W T F S
 12345 6
78910111213
14151617181920
21222324252627
28293031   

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jan. 10th, 2026 07:13 am
Powered by Dreamwidth Studios