elfs | Dear Code Fairy:

Dear Code Fairy:

I understand that your sudden fascination with Latent Dirichlet Allocation corresponds orthagonally to your idea of normalized HTML document databases with XPATH, but no, you may not resurrect Project ToXIC (Terabytes of XML, Indexed, Compressed). We don't have time.

But damn, LDA is cool. So much cooler than the TF/IDF (Term Frequency Over Inverse Documents Frequency) stuff we were doing back in '91-'92. And I still have a copy of the MG software package somewhere. For a document store, LDA would be golden. For a text-backed forum like Usenet, it would be platinum.

Whoa, there's a paper on LDA for Tag Normalization. OMG!

Dammit, I have work to do. Must. Not. Geek. Out!

Current Mood: geeky
Current Music: Phil Manzanera, East of Asteroid

Threaded | Top-Level Comments Only

From:

funos.livejournal.com

I have to say that I am happy and amused to see someone else enthused about some bit of wizardry. :)

From:

elfs.livejournal.com

I downloaded the reference implementation of LDA from one of the initial teams, and it's way slower than TF/IDF. TF/IDF of my website takes about 20 minutes; LDA would take about eight hours. Gack!

From:

funos.livejournal.com

Implementation problem, or intrinsic algorithmic deficiency?

From:

funos.livejournal.com

Addendum: I find this interesting, but do computer architecture: I know almost Jack about databases.

From:

elfs.livejournal.com

It's definitely a matter of algorithmic complexity. TF/IDF is incredibly easy to understand: You make an index of every word (minus stopwords) in a document, and next to each word you note how often it appears in the document. You create a "weight" for each term as a ratio of that versus the document size, and then just do simple math. The stuff I worked on fifteen years ago was about how to visually present those ratios meaningfully to users, something I've never seen implemented in real life except in specialized medical databases.

LDA is an entire 'nother beast. It attempts to discern "topics" out of "word pools" in every document: given the density of words W in document D, what is the probability that the writer was thinking about topic T when he/she wrote the document? Topic T is just an index in a database, not a real concept as human beings understand it, but by creating these probabilistic topic pools, you get a much more powerful and accurate indices leading the search engine to conclude that, "Given search terms A and B, you're probably looking for documents of topic T_x." But creating these "the cosine of the vector between two words' TF/IDF appearing in document D_y" for "all words appearing in the corpa being indexed" is very computationally expensive.

I think the Journal Entries would be easier to index if I took out all the stop words, though...

From:

funos.livejournal.com

Ah, thanks for the explanation!
(but erm...what's a stop word?)

However, it does sounds like the the cosine calculating part would parallelize very very well, assuming that each cosine can be calculated without dependency on any other cosine.

From:

elfs.livejournal.com

A "stop word" is a word you throw away in the preprocessing stage because it's not likely to be meaningful in any direct sense, the "filler" of the English language. A classic short list: i, a, an, are, as, at, be, by, for, from, how, in, is, it, of, on, or, that, the, this, to, was, what, when, where. Most lists are much longer.

From:

en-ki.livejournal.com

Hm. You ought to be able to take a sample, grind LDA over the sample, and then do LDA over the larger corpus indexing only words that met a threshold of interestingness in the sample.

Threaded | Top-Level Comments Only

Profile

Elf Sternberg

December 2025

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Page Summary

Active Entries

Style Credit

Base style: ColorSide by branchandroot
Theme: NNWM 2010 Fresh by timeasmymeasure

Expand Cut Tags

No cut tags

Page generated Dec. 31st, 2025 04:20 pm

Elf Sternberg

Dear Code Fairy:

Dear Code Fairy:

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Profile

December 2025

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags