elfs: (Default)
[personal profile] elfs
Dear Code Fairy:

I understand that your sudden fascination with Latent Dirichlet Allocation corresponds orthagonally to your idea of normalized HTML document databases with XPATH, but no, you may not resurrect Project ToXIC (Terabytes of XML, Indexed, Compressed). We don't have time.

But damn, LDA is cool. So much cooler than the TF/IDF (Term Frequency Over Inverse Documents Frequency) stuff we were doing back in '91-'92. And I still have a copy of the MG software package somewhere. For a document store, LDA would be golden. For a text-backed forum like Usenet, it would be platinum.

Whoa, there's a paper on LDA for Tag Normalization. OMG!

Dammit, I have work to do. Must. Not. Geek. Out!

Date: 2010-09-07 11:11 pm (UTC)
From: [identity profile] funos.livejournal.com
I have to say that I am happy and amused to see someone else enthused about some bit of wizardry. :)

Date: 2010-09-08 12:08 am (UTC)
From: [identity profile] elfs.livejournal.com
I downloaded the reference implementation of LDA from one of the initial teams, and it's way slower than TF/IDF. TF/IDF of my website takes about 20 minutes; LDA would take about eight hours. Gack!

Date: 2010-09-08 12:14 am (UTC)
From: [identity profile] funos.livejournal.com
Implementation problem, or intrinsic algorithmic deficiency?

Date: 2010-09-08 12:15 am (UTC)
From: [identity profile] funos.livejournal.com
Addendum: I find this interesting, but do computer architecture: I know almost Jack about databases.

Date: 2010-09-08 12:56 am (UTC)
From: [identity profile] elfs.livejournal.com
It's definitely a matter of algorithmic complexity. TF/IDF is incredibly easy to understand: You make an index of every word (minus stopwords) in a document, and next to each word you note how often it appears in the document. You create a "weight" for each term as a ratio of that versus the document size, and then just do simple math. The stuff I worked on fifteen years ago was about how to visually present those ratios meaningfully to users, something I've never seen implemented in real life except in specialized medical databases.

LDA is an entire 'nother beast. It attempts to discern "topics" out of "word pools" in every document: given the density of words W in document D, what is the probability that the writer was thinking about topic T when he/she wrote the document? Topic T is just an index in a database, not a real concept as human beings understand it, but by creating these probabilistic topic pools, you get a much more powerful and accurate indices leading the search engine to conclude that, "Given search terms A and B, you're probably looking for documents of topic Tx." But creating these "the cosine of the vector between two words' TF/IDF appearing in document Dy" for "all words appearing in the corpa being indexed" is very computationally expensive.

I think the Journal Entries would be easier to index if I took out all the stop words, though...

Date: 2010-09-08 01:37 am (UTC)
From: [identity profile] funos.livejournal.com
Ah, thanks for the explanation!
(but erm...what's a stop word?)

However, it does sounds like the the cosine calculating part would parallelize very very well, assuming that each cosine can be calculated without dependency on any other cosine.

Date: 2010-09-08 02:23 am (UTC)
From: [identity profile] elfs.livejournal.com
A "stop word" is a word you throw away in the preprocessing stage because it's not likely to be meaningful in any direct sense, the "filler" of the English language. A classic short list: i, a, an, are, as, at, be, by, for, from, how, in, is, it, of, on, or, that, the, this, to, was, what, when, where. Most lists are much longer.

Date: 2010-09-08 02:46 pm (UTC)
From: [identity profile] en-ki.livejournal.com
Hm. You ought to be able to take a sample, grind LDA over the sample, and then do LDA over the larger corpus indexing only words that met a threshold of interestingness in the sample.

Profile

elfs: (Default)
Elf Sternberg

December 2025

S M T W T F S
 12345 6
78910111213
14151617181920
21222324252627
28293031   

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Dec. 31st, 2025 04:20 pm
Powered by Dreamwidth Studios