lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: similarity matrix - more clear
Date Tue, 30 Nov 2004 22:48:06 GMT
: A possible solution would be to initialize in turn each document as a
: query, do a search using an IndexSearcher and to take from the search
: result the similarity between the query (which is in fact a document)
: and all the other documents. This is highly redundant, because the
: similarity between a pair of documents is computed multiple times.

A simpler aproach that i can think of would be to iterate over a complete
TermEnum of hte index, and for each Term, get the corisponding TermDocs
enumerator to list every document that contains that term.  Assuming that
every pair of docs initially has a similarity of "0" this would allow you
to incriment the similarity of each pair everytime you find a term that
multiple docs have in common.  (the amount you incriment the score for
each pair could be based on TermEnum.docFreq() and TermDocs.freq()).

A very simple approach might be something like...

   IndexReader r = ...;
   int[][] scores = new int[r.maxDocs()][r.maxDocs()];
   TermEnum enumerator = r.terms();
   TermDocs termDocs = r.termDocs();
   do {
      Term term = enumerator.term();
      if (term != null) {;
         Map docs = new HashMap();
         while ( {
         for (Iterator i = docs.keySet().iterator(); i.hasNext();) {
            for (Iterator j = docs.keySet().iterator(); j.hasNext();) {
               ii ==;
               jj =;
               if (ii < jj) {
                  continue; // do each pair only once
               scores[jj][ii] += (docs.get(ii) + docs.get(jj)) / 2
      } else {
   } while (;

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message