Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: softfail (nike.apache.org: transitioning domain of
 jon@lightboxtechnologies.com does not designate 209.85.217.180 as permitted
 sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAEY5pxVp5NMNEMdDvgAh-DynX0-+SMs6hywqbqZGxm9DKQyNHg@mail.gmail.com>
References: 
 <CAOEyTYS_61qBM8hP_fv0OQC4xACMYYOMbf4NLJkq2SiEKYNBug@mail.gmail.com>
	<CAEY5pxXsj8kXPTBoGpYP_vWPHOTSwYm18aRiYQTky_zK1CFFww@mail.gmail.com>
	<CAOEyTYSWvsFsGvpBE6pZ7XJXj5ypTDUf8rE9jqGWXJzLoptzkA@mail.gmail.com>
	<CAOdYfZVb4tO=Fy=x6tH_NgLT8jR7wV8ZV0hoSDQnTa2nZopDGg@mail.gmail.com>
	<CAOEyTYSFgnp+FfniGW18rcB8o5b1HXDYjGEdqpb+QHb4ac_89g@mail.gmail.com>
	<CAEY5pxUN3u+CEQKhasf8tOumW1fiYQSwE5D=mfm462-KGGPCgQ@mail.gmail.com>
	<CAOEyTYRRu_Dch4mWdibbVo6rav=5puW7Jk7tA-ZShePrbWyxew@mail.gmail.com>
	<CAEY5pxVp5NMNEMdDvgAh-DynX0-+SMs6hywqbqZGxm9DKQyNHg@mail.gmail.com>
Date: Fri, 18 Jan 2013 10:45:35 -0500
Message-ID: 
 <CAOEyTYQ1Y4Jp+18i96hLbOjC1HuGGYmnYU1ffVXAhhCAr5BVSQ@mail.gmail.com>
Subject: Re: Document term vectors in Lucene 4
From: Jon Stewart <jon@lightboxtechnologies.com>
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Thanks! I still can't see what was wrong with my original code--must
have been a dumb typo somewhere--but starting over from that example
now works on indices generated from my real indexing code. I will try
to blog about it next week so there is some sample code up on the web
for anyone else searching for how to do something similar.

I did not know about MultiFields, but yes, that seems to get rid of
the need for the SlowCompositeReaderWrapper. I really doubt
SlowCompositeReaderWrapper would be all that slow for my purposes,
though=E2=80=94I care more about indexing speed than ultra-fast query
responses. With multithreaded indexing, Lucene 4 seems to be able to
index files about as fast as I can read them in from disk, even
including Tika text extraction. Kudos.


Jon

On Fri, Jan 18, 2013 at 6:12 AM, Ian Lea <ian.lea@gmail.com> wrote:
> To get stats from the whole index I think you need to come at this
> from a different direction.  See the 4.0 migration guide for some
> details.
>
> With a variation on your code and 2 docs
>
> doc1: foobar qux quote
> doc2: foobar qux qux quorum
>
> this code snippet
>
>         Fields fields =3D MultiFields.getFields(rdr);
>         Terms terms =3D fields.terms("body");
>         TermsEnum te =3D terms.iterator(null);
>         while (te.next() !=3D null) {
>             String tt =3D te.term().utf8ToString();
>             System.out.printf("%s totalFreq()=3D%s, docFreq=3D%s\n",
>                               tt,
>                               te.totalTermFreq(),
>                               te.docFreq());
>         }
>
> displays
>
> foobar totalFreq()=3D2, docFreq=3D2
> quorum totalFreq()=3D1, docFreq=3D1
> quote totalFreq()=3D1, docFreq=3D1
> qux totalFreq()=3D3, docFreq=3D2
>
> This is with a standard IndexReader as returned by
> DirectoryReader.open(dir), on a RAMDirectory with 2 docs so there
> won't be many segments.  But from my reading of the migration guide
> you shouldn't need to use the Composite reader.
>
>
> Hope this helps - we are getting outside my area of expertise so don't
> trust anything I say.
>
>
> --
> Ian.
>
> On Thu, Jan 17, 2013 at 3:11 PM, Jon Stewart
> <jon@lightboxtechnologies.com> wrote:
>> D'oh!!!! Thanks!
>>
>> Does TermsEnum.totalTermFreq() return the per-doc frequencies? It
>> looks like it empirically, but the documentation refers to corpus
>> usage, not document.field usage.
>>
>> Jon
>>
>> On Thu, Jan 17, 2013 at 10:00 AM, Ian Lea <ian.lea@gmail.com> wrote:
>>> typo time.  You need doc2.add(...) not 2 doc.add(...) statements.
>>>
>>>
>>> --
>>> Ian.
>>>
>>>
>>> On Thu, Jan 17, 2013 at 2:49 PM, Jon Stewart
>>> <jon@lightboxtechnologies.com> wrote:
>>>> On Thu, Jan 17, 2013 at 9:08 AM, Robert Muir <rcmuir@gmail.com> wrote:
>>>>> Which statistics in particular (which methods)?
>>>>
>>>> I'd like to know the frequency of each term in each document. Those
>>>> term counts for the most frequent terms in the corpus will make it
>>>> into the document vectors for clustering.
>>>>
>>>> Looking at Terms and TermsEnum, I'm actually somewhat baffled about
>>>> how to do this. Iterating over the TermsEnums in a Terms retrieved by
>>>> IndexReader.getTermVector() will tell me about the presence of a term
>>>> within a document, but I don't see a simple "count" or "freq" method
>>>> in TermsEnum--the methods there look like corpus statistics.
>>>>
>>>> Based on Ian's reply, I created the following one-file test program.
>>>> The results I get are weird: I get a term vector back for the first
>>>> document, but not for the second.
>>>>
>>>> Output:
>>>> doc 0 had term 'baz'
>>>> doc 0 had term 'foobar'
>>>> doc 0 had term 'gibberish'
>>>> doc 0 had 3 terms
>>>> doc 1 had no term vector for body
>>>>
>>>> Thanks again for the responses and assistance.
>>>>
>>>>
>>>> Jon
>>>>
>>>>
>>>> import java.io.File;
>>>> import java.io.IOException;
>>>>
>>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>>>
>>>> import org.apache.lucene.index.IndexWriter;
>>>> import org.apache.lucene.index.IndexWriterConfig.OpenMode;
>>>> import org.apache.lucene.index.IndexWriterConfig;
>>>> import org.apache.lucene.index.FieldInfo.IndexOptions;
>>>> import org.apache.lucene.index.CorruptIndexException;
>>>> import org.apache.lucene.index.AtomicReader;
>>>> import org.apache.lucene.index.IndexableField;
>>>> import org.apache.lucene.index.Terms;
>>>> import org.apache.lucene.index.TermsEnum;
>>>> import org.apache.lucene.index.SlowCompositeReaderWrapper;
>>>> import org.apache.lucene.index.DirectoryReader;
>>>>
>>>> import org.apache.lucene.store.Directory;
>>>> import org.apache.lucene.store.FSDirectory;
>>>>
>>>> import org.apache.lucene.util.BytesRef;
>>>> import org.apache.lucene.util.Version;
>>>>
>>>> import org.apache.lucene.document.Document;
>>>> import org.apache.lucene.document.Field;
>>>> import org.apache.lucene.document.StringField;
>>>> import org.apache.lucene.document.FieldType;
>>>>
>>>> public class LuceneTest {
>>>>
>>>>   static void createIndex(final String path) throws IOException,
>>>> CorruptIndexException {
>>>>     final Directory dir =3D FSDirectory.open(new File(path));
>>>>     final StandardAnalyzer analyzer =3D new StandardAnalyzer(Version.L=
UCENE_40);
>>>>     final IndexWriterConfig iwc =3D new
>>>> IndexWriterConfig(Version.LUCENE_40, analyzer);
>>>>     iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
>>>>     iwc.setRAMBufferSizeMB(256.0);
>>>>     final IndexWriter writer =3D new IndexWriter(dir, iwc);
>>>>
>>>>     final FieldType bodyOptions =3D new FieldType();
>>>>     bodyOptions.setIndexed(true);
>>>>     bodyOptions.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITI=
ONS_AND_OFFSETS);
>>>>     bodyOptions.setStored(true);
>>>>     bodyOptions.setStoreTermVectors(true);
>>>>     bodyOptions.setTokenized(true);
>>>>
>>>>     final Document doc =3D new Document();
>>>>     doc.add(new Field("body", "this foobar is gibberish, baz", bodyOpt=
ions));
>>>>     writer.addDocument(doc);
>>>>
>>>>     final Document doc2 =3D new Document();
>>>>     doc.add(new Field("body", "I don't know what to tell you, qux.
>>>> Some foobar is just fubar.", bodyOptions));
>>>>     writer.addDocument(doc2);
>>>>
>>>>     writer.close();
>>>>   }
>>>>
>>>>   static void readIndex(final String path) throws IOException,
>>>> CorruptIndexException {
>>>>     final DirectoryReader dirReader =3D
>>>> DirectoryReader.open(FSDirectory.open(new File(path)));
>>>>     final SlowCompositeReaderWrapper rdr =3D new
>>>> SlowCompositeReaderWrapper(dirReader);
>>>>
>>>>     int max =3D rdr.maxDoc();
>>>>
>>>>     TermsEnum term =3D null;
>>>>     // iterate docs
>>>>     for (int i =3D 0; i < max; ++i) {
>>>>       // get term vector for body field
>>>>       final Terms terms =3D rdr.getTermVector(i, "body");
>>>>       if (terms !=3D null) {
>>>>         // count terms in doc
>>>>         int numTerms =3D 0;
>>>>         term =3D terms.iterator(term);
>>>>         while (term.next() !=3D null) {
>>>>           System.out.println("doc " + i + " had term '" +
>>>> term.term().utf8ToString() + "'");
>>>>           ++numTerms;
>>>>
>>>>           // would like to record doc term frequencies here, i.e.,
>>>> counts[i][term.term()] =3D term.freq()
>>>>         }
>>>>         System.out.println("doc " + i + " had " + numTerms + " terms")=
;
>>>>       }
>>>>       else {
>>>>         System.err.println("doc " + i + " had no term vector for body"=
);
>>>>       }
>>>>     }
>>>>   }
>>>>
>>>>   public static void main(String[] args) throws IOException,
>>>> InterruptedException, CorruptIndexException {
>>>>     final String path =3D args[0];
>>>>     createIndex(path);
>>>>     readIndex(path);
>>>>   }
>>>> }
>>>>
>>>> --
>>>> Jon Stewart, Principal
>>>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>>
>> --
>> Jon Stewart, Principal
>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


--=20
Jon Stewart, Principal
(646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org