Dear Code Fairy:
Sep. 7th, 2010 09:28 amDear Code Fairy:
I understand that your sudden fascination with Latent Dirichlet Allocation corresponds orthagonally to your idea of normalized HTML document databases with XPATH, but no, you may not resurrect Project ToXIC (Terabytes of XML, Indexed, Compressed). We don't have time.
But damn, LDA is cool. So much cooler than the TF/IDF (Term Frequency Over Inverse Documents Frequency) stuff we were doing back in '91-'92. And I still have a copy of the MG software package somewhere. For a document store, LDA would be golden. For a text-backed forum like Usenet, it would be platinum.
Whoa, there's a paper on LDA for Tag Normalization. OMG!
Dammit, I have work to do. Must. Not. Geek. Out!
I understand that your sudden fascination with Latent Dirichlet Allocation corresponds orthagonally to your idea of normalized HTML document databases with XPATH, but no, you may not resurrect Project ToXIC (Terabytes of XML, Indexed, Compressed). We don't have time.
But damn, LDA is cool. So much cooler than the TF/IDF (Term Frequency Over Inverse Documents Frequency) stuff we were doing back in '91-'92. And I still have a copy of the MG software package somewhere. For a document store, LDA would be golden. For a text-backed forum like Usenet, it would be platinum.
Whoa, there's a paper on LDA for Tag Normalization. OMG!
Dammit, I have work to do. Must. Not. Geek. Out!
no subject
Date: 2010-09-07 11:11 pm (UTC)no subject
Date: 2010-09-08 12:08 am (UTC)no subject
Date: 2010-09-08 12:14 am (UTC)no subject
Date: 2010-09-08 12:15 am (UTC)no subject
Date: 2010-09-08 12:56 am (UTC)LDA is an entire 'nother beast. It attempts to discern "topics" out of "word pools" in every document: given the density of words W in document D, what is the probability that the writer was thinking about topic T when he/she wrote the document? Topic T is just an index in a database, not a real concept as human beings understand it, but by creating these probabilistic topic pools, you get a much more powerful and accurate indices leading the search engine to conclude that, "Given search terms A and B, you're probably looking for documents of topic Tx." But creating these "the cosine of the vector between two words' TF/IDF appearing in document Dy" for "all words appearing in the corpa being indexed" is very computationally expensive.
I think the Journal Entries would be easier to index if I took out all the stop words, though...
no subject
Date: 2010-09-08 01:37 am (UTC)(but erm...what's a stop word?)
However, it does sounds like the the cosine calculating part would parallelize very very well, assuming that each cosine can be calculated without dependency on any other cosine.
no subject
Date: 2010-09-08 02:23 am (UTC)no subject
Date: 2010-09-08 02:46 pm (UTC)