Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 48544 invoked from network); 16 Nov 2001 07:44:51 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 16 Nov 2001 07:44:51 -0000 Received: (qmail 17659 invoked by uid 97); 16 Nov 2001 07:45:03 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 17467 invoked by uid 97); 16 Nov 2001 07:45:00 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 17456 invoked from network); 16 Nov 2001 07:44:59 -0000 Date: Thu, 15 Nov 2001 23:49:07 -0800 (Pacific Standard Time) From: Joshua O'Madadhain To: lucene-user@jakarta.apache.org Subject: extracting information from an index Message-ID: X-X-Sender: jmadden@smtp.ics.uci.edu MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N I am trying to construct a term-term correlation matrix from the data stored in the index, for an extension to the vector model that I am researching. In case my terminology is unfamiliar, what I need in order to do this is, for each term t, a list of those documents which contain t (also having a record of the number of times that t occurs in each would be a nice bonus). >From this I can calculate the rest of what I need (number of times that terms t1 and t2 occur in the same document, etc.). If necessary I could squeeze by with just knowing the number of documents in which t1, t2, and the combination (t1 AND t2) appear, but having the above information from which to work would give me more flexibility. Anyway, if there is a straightforward way of doing this that I have not yet spotted, I'd like to know what it is; if not, pointing me at the appropriate chunks of the source to start hacking on would also be appreciated. Thanks in advance for any help that may be offered. Regards, Joshua O'Madadhain (Madden) jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: For additional commands, e-mail: