So, yesterday I mananged to get down to the University of Washington library and access one of the two hard-core geek papers I was looking for. I found the entire exercise a bit surreal; the paper is kept on a repository in Israel and the University of Washington has a library subscription to the repository. That subscription costs thousands of dollars a year. But because I was willing to go through the rigamorale of walking across the campus, I was allowed to download the paper and email a copy to my home.
In any event, the paper I have is the root of a meme: Automatically Categorizing Written Texts By Author Gender. Y'know what? The algorithm published in the New York times, the one that everyone adapted for their cutesy little web pages and stuff-- is for fiction. If you apply it to non-fiction the results are, according to the authors, no better than random. Even done right, it's still only 78% accurate. The most accurate systems are "trained" using a "winnowed Bayesian filter" and after having hundreds of documents passed through it still have only a 82.6% chance of accurately deciding if the author of a non-fiction document was of a certain sex, and a 79.5% chance for a fiction document.
Computer-based stylometry has a long way to go. For the other paper I wanted, Towards Biometric Security Systems, I would have to drive to Corvallis, and I'm not quite ready to go that far.
As I was walking back across campus to an afternoon get-together with
shemayazi, I passed through a building with several sound-proofed "study rooms." Inside one, despite the sound-proofing, I could hear thumpy music and a strong, feminine voice counting out numbers. Coming around the corner, I saw four women of, I would guess, Vietnamese descent all moving to a beat, four bodies in an intense synchronized sequence of movements, all wearing practice spandex and loose comfortable tops, one chanting, one having trouble keeping up, and two so tightly in formation they may as well have been wired together. It was very strange. Interesting to watch, though.
I reached Shemayazi's home only to discover that her day had been, well, rough. It's her story; I'll let her tell it. But I did manage to pry her down to the corner Chinese restaurant, where I had deliciously heavy, glassy noodles in chicken broth. We talked of shoes and ships and sealing wax, went home, talked some more. Her diabetes was beating her up hard; she's quite the fighter, but the stress of the day hadn't helped and we just spent the evening sipping tea and talking.
I didn't sleep well at all. Between Kouryou-chan and just a general insomnia I was up and down most of the night. Annoying.
By the way, there's an interesting paper called The Blogging Iceberg, which finds that of the 4.12 million blogs on the Internet, 66% of all journals have been idle for at least two months and 26% were one-day wonders in which the poster never blogged again (Those figures overlap). Of the 1.63 million blogs that weren't mere tryouts, the average blog time was 126 days, LiveJournal has the lowest abandonment rate of any journaling service, abandoned blog entries tended to be less than 2/3rds as long as those of people who still maintain blogs (i.e. maintainers are people who like to write). Women were more likely to stick with their blogs.
I like the survey because it's not about InstaPundit, Dynamic, Sullivan, and Drudge. It's about the folks with nanoaudiences, who use Journals as a short circuit around keeping one's friends up-to-date about what's happening in your life.
In any event, the paper I have is the root of a meme: Automatically Categorizing Written Texts By Author Gender. Y'know what? The algorithm published in the New York times, the one that everyone adapted for their cutesy little web pages and stuff-- is for fiction. If you apply it to non-fiction the results are, according to the authors, no better than random. Even done right, it's still only 78% accurate. The most accurate systems are "trained" using a "winnowed Bayesian filter" and after having hundreds of documents passed through it still have only a 82.6% chance of accurately deciding if the author of a non-fiction document was of a certain sex, and a 79.5% chance for a fiction document.
Computer-based stylometry has a long way to go. For the other paper I wanted, Towards Biometric Security Systems, I would have to drive to Corvallis, and I'm not quite ready to go that far.
As I was walking back across campus to an afternoon get-together with
I reached Shemayazi's home only to discover that her day had been, well, rough. It's her story; I'll let her tell it. But I did manage to pry her down to the corner Chinese restaurant, where I had deliciously heavy, glassy noodles in chicken broth. We talked of shoes and ships and sealing wax, went home, talked some more. Her diabetes was beating her up hard; she's quite the fighter, but the stress of the day hadn't helped and we just spent the evening sipping tea and talking.
I didn't sleep well at all. Between Kouryou-chan and just a general insomnia I was up and down most of the night. Annoying.
By the way, there's an interesting paper called The Blogging Iceberg, which finds that of the 4.12 million blogs on the Internet, 66% of all journals have been idle for at least two months and 26% were one-day wonders in which the poster never blogged again (Those figures overlap). Of the 1.63 million blogs that weren't mere tryouts, the average blog time was 126 days, LiveJournal has the lowest abandonment rate of any journaling service, abandoned blog entries tended to be less than 2/3rds as long as those of people who still maintain blogs (i.e. maintainers are people who like to write). Women were more likely to stick with their blogs.
I like the survey because it's not about InstaPundit, Dynamic, Sullivan, and Drudge. It's about the folks with nanoaudiences, who use Journals as a short circuit around keeping one's friends up-to-date about what's happening in your life.
What this says to me is that too many whites are getting away with drug use. Too many whites are getting away with drug sales. Too many whites are getting away with trafficking in this stuff. The answer to this disparity is not to start letting people out of jail because we're not putting others in jail who are breaking the law. The answer is to go out and find the ones who are getting away with it, convict them and send them up the river, too.
-- Rush Limbaugh, October 5, 1999.
no subject
Date: 2003-10-14 04:45 pm (UTC)That tended to be a bit of a footnote, eh?
I did find that plugging the first chapter of a book I'd written gave correct results...
no subject
Date: 2003-10-14 04:59 pm (UTC)Um, I work in Corvallis, so if you really want the paper, let me know where to find it and how to get it to you.
FWIW, I ran passages of my fiction through a couple of variants on the gender-guesser, and it all came out male, male, male.