elfs: (Default)
[personal profile] elfs
I need a new business card: Internet Services Consultant for MacBroadcast, Inc.. Omaha's associaties at her new venture were a little clueless about what to do with a 500 megabyte log file which contained the web traffic on their servers for the past couple of months. The trouble was simple: it covered several months, it covered several different servers, and it was raw data.

I have to admit that ever since Mac people moved to the BSD platform, helping them find their way about the Unix world has been fun. Sometimes, though, it's a bit surprising to learn just how great the gap really is. The gentleman manning Omaha's NOC is competent and quite eager but he doesn't have fifteen years of experience at the Unix keyboard. That's why I'm their consultant.

To me, this problem has two answers. The first is simple: break the data out into its own files. Fortunately log files have a reasonably regular format, and writing a parser is a done deal. Equally fortunate, parts of the log file have reasonable formats that can be identified. Here's a line from my log:

12-253-167-94.client.squibble.com - - [27/Oct/2003:22:11:02 -0800] "GET /~elf/ HTTP/1.1" 200 23788 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

The fields are: client address, username, password, date, request, response code, bytes sent, referrer, client software. Usually, spaces are used to separate fields, but the date has embedded spaces so is surrounded by brackets, and requests and clients have embedded spaces so they're surrounded by quotes.

If I want all of the lines from October, the command would be:

grep '^[^"]*/Oct/' httpd.log

There's two important pieces of knowledge embedded in this line. The first is that "/Oct/" is valid in a request, so I want to exclude requests that contain "/Oct/" from triggering all lines generated in October-- they could be generated at any time. So the opening to the expression says, "Find me any instance of Oct that is surrounded by slashes but has nowhere before it a quote." Since requests are contained in quotes, this removes lines where Oct is proceeded by a quote-- except when there's anotherOct before it.

The whole line would then be:

for i in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec; do grep '^[^"]/'$i'/' httpd.log > httpd-$i.log; done

And then you'd have files with names like "httpd-Oct.log" with only October's log lines. But for all the months. Automatically.

When there are multiple servers in the same file, you have to do similar parsing with the request object, but the routine is the same. Every server on a multihome box has a unique root; searching for that route is the tool of choice.

grep '"[A-Z][A-Z][A-Z][A-Z]* /~elf/' httpd.log gives me all of the lines that refer to my home, explicitly, of all the users I share my server with. Requests always start with a quote, and then a string in all caps of at least three characters, a space, and then the path to the request. "/~elf/" is the unique root for all paths requesting my stuff.

The tool of choice for processing all of this data is Analog, which internally has an incredible amount of monkey business to extract every last detail out of a log file without growing insanely-- even when handling 500MB log files. There's also Report Magic, which goes one step further and analyzes the Analog meta-data to produce stunning and very printable reports. Clients used, most popular facilities, peak traffic times, all that is presented. One thing it can't tell you (but it does a reasonable estimate) is how many visitors you've had; the nature of dynamic IP allocation and so on makes that impossible. If everyone in my house visited your site, you'd still only see one visitor, because we have four people running NAT'ted, and they'd have only one allocated IP address visible to the outside internet.

The second answer is to use a utility such that every line from the server is sent to a "real" database, like SQL Server or PostgreSQL. Once you've got the data in the database, you can do anything you want with it: extract, rotate, analyze, etc. There's even an all-unix tool for it: mod_pgsqllog.

I know it's the popular answer to a lot of today's website problems, but to me it has problems of its own. First, it's heavy. It requires a whole 'nother machine, and a hefty one, to handle the logging and analysis. Second, it's unreliable-- it depends not on the filesystem, which has to work for the webserver to work, but on a database, which is a big and complex piece of software all by itself. And third, it's slow. Much of the data you will be analyzing will need analysis once and only once, and then it'll be history (literally), so there's no real reason to keep the data hot for any longer than necessary.

I suppose it's a prejudice on my part that I want the data to remain human-readable, and use human-readable, text-based tools for all of my analysis. And in this case, the tools for analyzing common log formats in human-readable ways is mature and powerful, and in my opinion better.

Joel Spolsky has an article on software biculturalism, where he points out that Unix people write programs useful to other programmers, while Windows people write programs useful to common users. The advantage that Unix people have is that they can build infinitely and very rapidly on this robust toolkit, but they have trouble getting end-users to understand what they're trying to do. The advantage Windows people have is that end-users "get" what they're trying to do almost immediately, but the underlying foundation is very hard to develop on further. This is why Linux is ten years old and still developing rapidly with very little cash, while Windows has "thrown out and started over" four iterations of their operating system (and Office) and has cost Microsoft billions of dollars to do it.

Date: 2003-12-24 12:18 am (UTC)
ext_3294: Tux (Default)
From: [identity profile] technoshaman.livejournal.com
Windows has "thrown out and started over" four iterations of their operating system (and Office) and has cost Microsoft billions of dollars to do it.

s/Microsoft/Microsoft's customers/

Profile

elfs: (Default)
Elf Sternberg

December 2025

S M T W T F S
 12345 6
78910111213
14151617181920
21222324252627
28293031   

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Dec. 28th, 2025 05:10 pm
Powered by Dreamwidth Studios