lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Kiefer <kie...@ifi.unizh.ch>
Subject Re: TFIDF Implementation
Date Wed, 15 Dec 2004 09:08:16 GMT
David, Bruce, Otis,
Thank you all for the quick replies. I looked through the BooksLikeThis
example. I also agree, it's a very good and effective way to find
similar docs in the index. Nevertheless, what I need is really a
similarity matrix holding all TF*IDF values. For illustration I quick
and dirty wrote a class to perform that task. It uses the Jama.Matrix
class to represent the similarity matrix at the moment. For show and
tell I attached it to this email.
Unfortunately it doesn't perform very well. My index stores about 25000
docs with a total of 75000 terms. The similarity matrix is very sparse
but nevertheless needs about 1'875'000'000 entries!!! I think this
current implementation will not be useable in this way. I also think I
switch to JMP (http://www.math.uib.no/~bjornoh/mtj/) for that reason.

What do you think?

Best,
Christoph

-- 
Christoph Kiefer

Department of Informatics, University of Zurich

Office: Uni Irchel 27-K-32
Phone:  +41 (0) 44 / 635 67 26
Email:  kiefer@ifi.unizh.ch
Web:    http://www.ifi.unizh.ch/ddis/christophkiefer.0.html

Mime
View raw message