Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7EA68E2B8 for ; Thu, 17 Jan 2013 15:12:14 +0000 (UTC) Received: (qmail 70480 invoked by uid 500); 17 Jan 2013 15:12:12 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 70401 invoked by uid 500); 17 Jan 2013 15:12:12 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 70384 invoked by uid 99); 17 Jan 2013 15:12:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jan 2013 15:12:11 +0000 X-ASF-Spam-Status: No, hits=0.3 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: softfail (nike.apache.org: transitioning domain of jon@lightboxtechnologies.com does not designate 74.125.82.49 as permitted sender) Received: from [74.125.82.49] (HELO mail-wg0-f49.google.com) (74.125.82.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jan 2013 15:12:03 +0000 Received: by mail-wg0-f49.google.com with SMTP id 15so1728860wgd.16 for ; Thu, 17 Jan 2013 07:11:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=lightboxtechnologies.com; s=google; h=mime-version:x-received:x-originating-ip:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=aafdTpmfJ2OoigVSeSL74BfysvXQ70yCP33Wbyc4HPs=; b=atr8luB27mwHUySobE1YfN4+uToyi2N8LMmdCbsWeN2XPvyzEGmMdxC5PDmtHMfkwe 4YHpiB4N1lB+/MFeAxSdQnMDHIWOYWsGcTYVRx5DC6bHJSkLGUnwHDI3dnLbCPXiG8Bt G8R/jXjU427ux7/r+YHgFrlu86vbwDsDMid2Y= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:x-originating-ip:in-reply-to:references :date:message-id:subject:from:to:content-type:x-gm-message-state; bh=aafdTpmfJ2OoigVSeSL74BfysvXQ70yCP33Wbyc4HPs=; b=kLtnQuhnkXuv/8raOHIKz27LhJ64IHK+CA3PC6yBUr8PQvLzrjS0hrDRJs3dZdel07 tfkdL3ChXM1yRJORLe5AQO0f3XnS/W7KrBhVP9rulXIJxnTDA4hoEf9sE6gz/Ewn+h2K FdXLBnaaYSkQ6my651vO4q7Bi1mJiHpxNViEB0ZM/q6AM3jdb5MpRG+SCXd+CSWzQ4tN HQnnjgsLiU3qLfG5hYGsGr7BfuKYEoouA251OA+mEaubZ04zq/CO0zeWt2/A8PVRhOGh Hvj6sURiQ2eLCWim8vUCXDlkfbfpV3Ekne6O9SuxU79Cf4lmIVgpVwev5tQD9sFk/oyk vYsw== MIME-Version: 1.0 X-Received: by 10.180.20.198 with SMTP id p6mr16454907wie.19.1358435503682; Thu, 17 Jan 2013 07:11:43 -0800 (PST) Received: by 10.216.100.72 with HTTP; Thu, 17 Jan 2013 07:11:43 -0800 (PST) X-Originating-IP: [68.33.9.116] In-Reply-To: References: Date: Thu, 17 Jan 2013 10:11:43 -0500 Message-ID: Subject: Re: Document term vectors in Lucene 4 From: Jon Stewart To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 X-Gm-Message-State: ALoCoQnXVmbpf6ShMNzA0nrw+WeddJwac5+FgSXqzf6/Xw0enF0zQaqFU7Bo574sG8QmwcSTsk0v X-Virus-Checked: Checked by ClamAV on apache.org D'oh!!!! Thanks! Does TermsEnum.totalTermFreq() return the per-doc frequencies? It looks like it empirically, but the documentation refers to corpus usage, not document.field usage. Jon On Thu, Jan 17, 2013 at 10:00 AM, Ian Lea wrote: > typo time. You need doc2.add(...) not 2 doc.add(...) statements. > > > -- > Ian. > > > On Thu, Jan 17, 2013 at 2:49 PM, Jon Stewart > wrote: >> On Thu, Jan 17, 2013 at 9:08 AM, Robert Muir wrote: >>> Which statistics in particular (which methods)? >> >> I'd like to know the frequency of each term in each document. Those >> term counts for the most frequent terms in the corpus will make it >> into the document vectors for clustering. >> >> Looking at Terms and TermsEnum, I'm actually somewhat baffled about >> how to do this. Iterating over the TermsEnums in a Terms retrieved by >> IndexReader.getTermVector() will tell me about the presence of a term >> within a document, but I don't see a simple "count" or "freq" method >> in TermsEnum--the methods there look like corpus statistics. >> >> Based on Ian's reply, I created the following one-file test program. >> The results I get are weird: I get a term vector back for the first >> document, but not for the second. >> >> Output: >> doc 0 had term 'baz' >> doc 0 had term 'foobar' >> doc 0 had term 'gibberish' >> doc 0 had 3 terms >> doc 1 had no term vector for body >> >> Thanks again for the responses and assistance. >> >> >> Jon >> >> >> import java.io.File; >> import java.io.IOException; >> >> import org.apache.lucene.analysis.standard.StandardAnalyzer; >> >> import org.apache.lucene.index.IndexWriter; >> import org.apache.lucene.index.IndexWriterConfig.OpenMode; >> import org.apache.lucene.index.IndexWriterConfig; >> import org.apache.lucene.index.FieldInfo.IndexOptions; >> import org.apache.lucene.index.CorruptIndexException; >> import org.apache.lucene.index.AtomicReader; >> import org.apache.lucene.index.IndexableField; >> import org.apache.lucene.index.Terms; >> import org.apache.lucene.index.TermsEnum; >> import org.apache.lucene.index.SlowCompositeReaderWrapper; >> import org.apache.lucene.index.DirectoryReader; >> >> import org.apache.lucene.store.Directory; >> import org.apache.lucene.store.FSDirectory; >> >> import org.apache.lucene.util.BytesRef; >> import org.apache.lucene.util.Version; >> >> import org.apache.lucene.document.Document; >> import org.apache.lucene.document.Field; >> import org.apache.lucene.document.StringField; >> import org.apache.lucene.document.FieldType; >> >> public class LuceneTest { >> >> static void createIndex(final String path) throws IOException, >> CorruptIndexException { >> final Directory dir = FSDirectory.open(new File(path)); >> final StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); >> final IndexWriterConfig iwc = new >> IndexWriterConfig(Version.LUCENE_40, analyzer); >> iwc.setOpenMode(OpenMode.CREATE_OR_APPEND); >> iwc.setRAMBufferSizeMB(256.0); >> final IndexWriter writer = new IndexWriter(dir, iwc); >> >> final FieldType bodyOptions = new FieldType(); >> bodyOptions.setIndexed(true); >> bodyOptions.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); >> bodyOptions.setStored(true); >> bodyOptions.setStoreTermVectors(true); >> bodyOptions.setTokenized(true); >> >> final Document doc = new Document(); >> doc.add(new Field("body", "this foobar is gibberish, baz", bodyOptions)); >> writer.addDocument(doc); >> >> final Document doc2 = new Document(); >> doc.add(new Field("body", "I don't know what to tell you, qux. >> Some foobar is just fubar.", bodyOptions)); >> writer.addDocument(doc2); >> >> writer.close(); >> } >> >> static void readIndex(final String path) throws IOException, >> CorruptIndexException { >> final DirectoryReader dirReader = >> DirectoryReader.open(FSDirectory.open(new File(path))); >> final SlowCompositeReaderWrapper rdr = new >> SlowCompositeReaderWrapper(dirReader); >> >> int max = rdr.maxDoc(); >> >> TermsEnum term = null; >> // iterate docs >> for (int i = 0; i < max; ++i) { >> // get term vector for body field >> final Terms terms = rdr.getTermVector(i, "body"); >> if (terms != null) { >> // count terms in doc >> int numTerms = 0; >> term = terms.iterator(term); >> while (term.next() != null) { >> System.out.println("doc " + i + " had term '" + >> term.term().utf8ToString() + "'"); >> ++numTerms; >> >> // would like to record doc term frequencies here, i.e., >> counts[i][term.term()] = term.freq() >> } >> System.out.println("doc " + i + " had " + numTerms + " terms"); >> } >> else { >> System.err.println("doc " + i + " had no term vector for body"); >> } >> } >> } >> >> public static void main(String[] args) throws IOException, >> InterruptedException, CorruptIndexException { >> final String path = args[0]; >> createIndex(path); >> readIndex(path); >> } >> } >> >> -- >> Jon Stewart, Principal >> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > -- Jon Stewart, Principal (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org