lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Document term vectors in Lucene 4
Date Thu, 17 Jan 2013 14:08:40 GMT
Which statistics in particular (which methods)?

On Thu, Jan 17, 2013 at 5:10 AM, Jon Stewart
<jon@lightboxtechnologies.com> wrote:
> Thanks very much for your reply, Ian.
>
> I am using SlowCompositeReaderWrapper because I am also retrieving the
> term frequency statistics for the corpus (at the end of the day, I am
> doing some machine learning/document clustering). Despite its name and
> warning documentation not to use it, SlowCompositeReaderWrapper seems
> to be the only baked-in way of getting total corpus term statistics
> from a DirectoryReader, n'est-ce pas? Incidentally, I am using the
> StandardAnalyzer as well.
>
>
> Jon
>
> On Thu, Jan 17, 2013 at 5:06 AM, Ian Lea <ian.lea@gmail.com> wrote:
>> When I run your code, as is except for using RAMDirectory and setting
>> up an IndexWriter using StandardAnalyzer
>>
>>         RAMDirectory dir = new RAMDirectory();
>>         Analyzer anl = new StandardAnalyzer(Version.LUCENE_40);
>>         IndexWriterConfig iwcfg = new IndexWriterConfig(Version.LUCENE_40, anl);
>>         IndexWriter iw = new IndexWriter(dir, iwcfg);
>>         ...
>>         iw.addDocument(doc);
>>         iw.close();
>>
>> it prints
>>
>> doc 0 had 1 terms.
>>
>> If change text to .e.g. "this is foobar gibberish" it says there are 2
>> terms.  So it looks OK to me. "this" and "is" are presumably in the
>> default list of stop words.
>>
>> Not relevant, but why are you using SlowCompositeReaderWrapper rather than just
>> IndexReader rdr = DirectoryReader.open(dir)?  I get the same results either way,
>>
>>
>> --
>> Ian.
>>
>>
>> On Thu, Jan 17, 2013 at 5:52 AM, Jon Stewart
>> <jon@lightboxtechnologies.com> wrote:
>>> Hello,
>>>
>>> I cannot extract document term vectors from an index, and have not
>>> turned up much in some determined googling. In short, when I call
>>> IndexReader.getTermVector(docID, field) or
>>> IndexReader.getTermVectors(docID) and then navigate down to the Terms
>>> for the specified field, I get a null result.
>>>
>>> // Indexing:
>>>   String bodyText = "this is foobar";
>>>   final FieldType BodyOptions = new FieldType();
>>>   BodyOptions.setIndexed(true);
>>>   BodyOptions.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
>>>   BodyOptions.setStored(true);
>>>   BodyOptions.setStoreTermVectors(true);
>>>   BodyOptions.setTokenized(true);
>>>   Document doc = new Document();
>>>   doc.add(new Field("body", bodyText, BodyOptions));
>>>
>>> When I examine docs in Luke, I can see the term vectors.
>>>
>>> // Retrieving (at a later time)
>>>   DirectoryReader dirRdr = DirectoryReader.open(FSDirectory.open(new
>>> File(path)));
>>>   SlowCompositeReaderWrapper rdr = new SlowCompositeReaderWrapper(dirRdr);
>>>   for (int i = 0; i < rdr.maxDoc(); ++i) {
>>>     int numTerms = 0;
>>>     Terms terms = rdr.getTermVector(i, "body");
>>>     if (terms != null) {
>>>       TermsEnum term = terms.iterator(null);
>>>       while (term.next() != null) {
>>>         ++numTerms;
>>>       }
>>>       System.out.println("doc " + i + " had " + numTerms + " terms");
>>>     }
>>>     else {
>>>       System.err.println("null term vector on doc " + i);
>>>     }
>>>   }
>>>
>>> On every doc, the Terms object I get back from getTermVector(i, "body") is null.
>>>
>>>
>>> Jon
>>> --
>>> Jon Stewart, Principal
>>> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
>
> --
> Jon Stewart, Principal
> (646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message