Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8E814E4B6 for ; Fri, 18 Jan 2013 15:46:11 +0000 (UTC) Received: (qmail 40927 invoked by uid 500); 18 Jan 2013 15:46:09 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 40804 invoked by uid 500); 18 Jan 2013 15:46:06 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 40456 invoked by uid 99); 18 Jan 2013 15:46:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jan 2013 15:46:05 +0000 X-ASF-Spam-Status: No, hits=0.3 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: softfail (nike.apache.org: transitioning domain of jon@lightboxtechnologies.com does not designate 209.85.217.180 as permitted sender) Received: from [209.85.217.180] (HELO mail-lb0-f180.google.com) (209.85.217.180) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jan 2013 15:45:57 +0000 Received: by mail-lb0-f180.google.com with SMTP id gj3so2781242lbb.39 for ; Fri, 18 Jan 2013 07:45:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=lightboxtechnologies.com; s=google; h=mime-version:x-received:x-originating-ip:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=IEPd6lYXUkDY5ryAvLunHRfOHEnclO+R2XJn0mS67RU=; b=kyE/GVuv+UVOnwtIY1qdt9csX3L/MplbvgtUzCg8INVwriKwjNziKQYP6En93hCrkn B133TAqOiLN3St7ln1LQRNT0YLXe5ggzF8iQmcPcKfRDmHB3exGfxgPxEeSAIDIWZTIs 3idq9o7sD25tBUli86WKIS+BYc2hZLGOPLVRg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:x-originating-ip:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding:x-gm-message-state; bh=IEPd6lYXUkDY5ryAvLunHRfOHEnclO+R2XJn0mS67RU=; b=FI1/rgsNcF65CP1qI40R3v1a05apJAHFw54c6hiHC8PeWw3z+2jK3TBxDiudEHDYzZ d/OtmJPrVEqNGwt5hFYvQci4G9zUmkQx6AOSAElNr3YAxAHdfh1asTfX8YpSIP53O29P Myd3tUEFwOd7Ts9Ah+Xk0Zo86eSVaaVtEZriG/SdB+1PECQQJgwfKmaCtsXR1JhsDG5A eOd/bCBE+RdbF4aLQy7AZ9c77yZv1w4E2Vz8qTtsi3hhl1Qdx59JkurwzrjvQAbD5eoD PCjoWsNFYb4N8Jt7IZGUM0dv+KufRIqPfyCDulsPhjxFERJ4oST0YstHVI+TdhiXc5WT vC7A== MIME-Version: 1.0 X-Received: by 10.112.103.135 with SMTP id fw7mr3935804lbb.17.1358523935303; Fri, 18 Jan 2013 07:45:35 -0800 (PST) Received: by 10.114.14.97 with HTTP; Fri, 18 Jan 2013 07:45:35 -0800 (PST) X-Originating-IP: [68.33.9.116] In-Reply-To: References: Date: Fri, 18 Jan 2013 10:45:35 -0500 Message-ID: Subject: Re: Document term vectors in Lucene 4 From: Jon Stewart To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQlFLIfrcYIlMPvODGo3FS97zKzvOuTZEYgqVD6JUSkRbAiX9ykAhVfxTIxBvLagyWOaJLE3 X-Virus-Checked: Checked by ClamAV on apache.org Thanks! I still can't see what was wrong with my original code--must have been a dumb typo somewhere--but starting over from that example now works on indices generated from my real indexing code. I will try to blog about it next week so there is some sample code up on the web for anyone else searching for how to do something similar. I did not know about MultiFields, but yes, that seems to get rid of the need for the SlowCompositeReaderWrapper. I really doubt SlowCompositeReaderWrapper would be all that slow for my purposes, though=E2=80=94I care more about indexing speed than ultra-fast query responses. With multithreaded indexing, Lucene 4 seems to be able to index files about as fast as I can read them in from disk, even including Tika text extraction. Kudos. Jon On Fri, Jan 18, 2013 at 6:12 AM, Ian Lea wrote: > To get stats from the whole index I think you need to come at this > from a different direction. See the 4.0 migration guide for some > details. > > With a variation on your code and 2 docs > > doc1: foobar qux quote > doc2: foobar qux qux quorum > > this code snippet > > Fields fields =3D MultiFields.getFields(rdr); > Terms terms =3D fields.terms("body"); > TermsEnum te =3D terms.iterator(null); > while (te.next() !=3D null) { > String tt =3D te.term().utf8ToString(); > System.out.printf("%s totalFreq()=3D%s, docFreq=3D%s\n", > tt, > te.totalTermFreq(), > te.docFreq()); > } > > displays > > foobar totalFreq()=3D2, docFreq=3D2 > quorum totalFreq()=3D1, docFreq=3D1 > quote totalFreq()=3D1, docFreq=3D1 > qux totalFreq()=3D3, docFreq=3D2 > > This is with a standard IndexReader as returned by > DirectoryReader.open(dir), on a RAMDirectory with 2 docs so there > won't be many segments. But from my reading of the migration guide > you shouldn't need to use the Composite reader. > > > Hope this helps - we are getting outside my area of expertise so don't > trust anything I say. > > > -- > Ian. > > On Thu, Jan 17, 2013 at 3:11 PM, Jon Stewart > wrote: >> D'oh!!!! Thanks! >> >> Does TermsEnum.totalTermFreq() return the per-doc frequencies? It >> looks like it empirically, but the documentation refers to corpus >> usage, not document.field usage. >> >> Jon >> >> On Thu, Jan 17, 2013 at 10:00 AM, Ian Lea wrote: >>> typo time. You need doc2.add(...) not 2 doc.add(...) statements. >>> >>> >>> -- >>> Ian. >>> >>> >>> On Thu, Jan 17, 2013 at 2:49 PM, Jon Stewart >>> wrote: >>>> On Thu, Jan 17, 2013 at 9:08 AM, Robert Muir wrote: >>>>> Which statistics in particular (which methods)? >>>> >>>> I'd like to know the frequency of each term in each document. Those >>>> term counts for the most frequent terms in the corpus will make it >>>> into the document vectors for clustering. >>>> >>>> Looking at Terms and TermsEnum, I'm actually somewhat baffled about >>>> how to do this. Iterating over the TermsEnums in a Terms retrieved by >>>> IndexReader.getTermVector() will tell me about the presence of a term >>>> within a document, but I don't see a simple "count" or "freq" method >>>> in TermsEnum--the methods there look like corpus statistics. >>>> >>>> Based on Ian's reply, I created the following one-file test program. >>>> The results I get are weird: I get a term vector back for the first >>>> document, but not for the second. >>>> >>>> Output: >>>> doc 0 had term 'baz' >>>> doc 0 had term 'foobar' >>>> doc 0 had term 'gibberish' >>>> doc 0 had 3 terms >>>> doc 1 had no term vector for body >>>> >>>> Thanks again for the responses and assistance. >>>> >>>> >>>> Jon >>>> >>>> >>>> import java.io.File; >>>> import java.io.IOException; >>>> >>>> import org.apache.lucene.analysis.standard.StandardAnalyzer; >>>> >>>> import org.apache.lucene.index.IndexWriter; >>>> import org.apache.lucene.index.IndexWriterConfig.OpenMode; >>>> import org.apache.lucene.index.IndexWriterConfig; >>>> import org.apache.lucene.index.FieldInfo.IndexOptions; >>>> import org.apache.lucene.index.CorruptIndexException; >>>> import org.apache.lucene.index.AtomicReader; >>>> import org.apache.lucene.index.IndexableField; >>>> import org.apache.lucene.index.Terms; >>>> import org.apache.lucene.index.TermsEnum; >>>> import org.apache.lucene.index.SlowCompositeReaderWrapper; >>>> import org.apache.lucene.index.DirectoryReader; >>>> >>>> import org.apache.lucene.store.Directory; >>>> import org.apache.lucene.store.FSDirectory; >>>> >>>> import org.apache.lucene.util.BytesRef; >>>> import org.apache.lucene.util.Version; >>>> >>>> import org.apache.lucene.document.Document; >>>> import org.apache.lucene.document.Field; >>>> import org.apache.lucene.document.StringField; >>>> import org.apache.lucene.document.FieldType; >>>> >>>> public class LuceneTest { >>>> >>>> static void createIndex(final String path) throws IOException, >>>> CorruptIndexException { >>>> final Directory dir =3D FSDirectory.open(new File(path)); >>>> final StandardAnalyzer analyzer =3D new StandardAnalyzer(Version.L= UCENE_40); >>>> final IndexWriterConfig iwc =3D new >>>> IndexWriterConfig(Version.LUCENE_40, analyzer); >>>> iwc.setOpenMode(OpenMode.CREATE_OR_APPEND); >>>> iwc.setRAMBufferSizeMB(256.0); >>>> final IndexWriter writer =3D new IndexWriter(dir, iwc); >>>> >>>> final FieldType bodyOptions =3D new FieldType(); >>>> bodyOptions.setIndexed(true); >>>> bodyOptions.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITI= ONS_AND_OFFSETS); >>>> bodyOptions.setStored(true); >>>> bodyOptions.setStoreTermVectors(true); >>>> bodyOptions.setTokenized(true); >>>> >>>> final Document doc =3D new Document(); >>>> doc.add(new Field("body", "this foobar is gibberish, baz", bodyOpt= ions)); >>>> writer.addDocument(doc); >>>> >>>> final Document doc2 =3D new Document(); >>>> doc.add(new Field("body", "I don't know what to tell you, qux. >>>> Some foobar is just fubar.", bodyOptions)); >>>> writer.addDocument(doc2); >>>> >>>> writer.close(); >>>> } >>>> >>>> static void readIndex(final String path) throws IOException, >>>> CorruptIndexException { >>>> final DirectoryReader dirReader =3D >>>> DirectoryReader.open(FSDirectory.open(new File(path))); >>>> final SlowCompositeReaderWrapper rdr =3D new >>>> SlowCompositeReaderWrapper(dirReader); >>>> >>>> int max =3D rdr.maxDoc(); >>>> >>>> TermsEnum term =3D null; >>>> // iterate docs >>>> for (int i =3D 0; i < max; ++i) { >>>> // get term vector for body field >>>> final Terms terms =3D rdr.getTermVector(i, "body"); >>>> if (terms !=3D null) { >>>> // count terms in doc >>>> int numTerms =3D 0; >>>> term =3D terms.iterator(term); >>>> while (term.next() !=3D null) { >>>> System.out.println("doc " + i + " had term '" + >>>> term.term().utf8ToString() + "'"); >>>> ++numTerms; >>>> >>>> // would like to record doc term frequencies here, i.e., >>>> counts[i][term.term()] =3D term.freq() >>>> } >>>> System.out.println("doc " + i + " had " + numTerms + " terms")= ; >>>> } >>>> else { >>>> System.err.println("doc " + i + " had no term vector for body"= ); >>>> } >>>> } >>>> } >>>> >>>> public static void main(String[] args) throws IOException, >>>> InterruptedException, CorruptIndexException { >>>> final String path =3D args[0]; >>>> createIndex(path); >>>> readIndex(path); >>>> } >>>> } >>>> >>>> -- >>>> Jon Stewart, Principal >>>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >> >> >> >> -- >> Jon Stewart, Principal >> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --=20 Jon Stewart, Principal (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org