lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Kumar <ami...@uiuc.edu>
Subject Re: IndexReader.getTermFreqVector penality
Date Wed, 09 Aug 2006 19:57:01 GMT
Yes thanks Grant I realize that if I need the term freq in all the  
documents I could use TermEnum, but I have a use case where
I may need term frequencies of only selected documents, and the worst  
case scenario might be  term freq for n-1
documents, where n is the total number of documents in the index.

-Amit


On Aug 9, 2006, at 2:34 PM, Grant Ingersoll wrote:

> Hi Amit,
>
> If you want all the freqs of all the terms (or even just some of  
> the terms) in all documents, you don't need to use Term Vectors,  
> take a look at TermEnum and TermDocs.
>
> If you want for specific documents, then you do need Term Vectors.   
> You may get some CPU ticks by only keeping Positions (and not  
> offsets, assuming you don't need them), but I haven't confirmed  
> this.  I just figure it is slightly less data to read and write so  
> it should be faster.
>
> No, you do not need to Store your docs to have a term vector.  They  
> are written to different parts of the Index.
>
> -Grant
>
> On Aug 9, 2006, at 3:13 PM, Amit Kumar wrote:
>
>> Hi Lucene Users,
>>
>> I am using the lucene indices to get term frequencies. I just  
>> wanted to check with you about the
>> time it is taking to retrieve these term freq. Please suggest if I  
>> can improve the code/index or if
>> this is expected. It takes 8 to 9 seconds to retrieve the term  
>> freq values of all 1030 documents,
>> with an index size of ~530MB.
>>
>> Another question I have is Do I need to have Field.Store.Yes to  
>> get the term freq vector?
>>
>> Index Details:
>> -------------------
>> Size: 532 MB,
>> 1032 Documents with varying number of terms from 600 to 100,000
>> The field is indexed as Field.Store.YES,  
>> Field.Index.TOKENIZED,Field.TermVector.WITH_POSITIONS_OFFSETS
>>
>>
>> Term Freq Retrieval Time Values:
>> -------------------------------------
>>
>> The time ranges in 8 to 9 seconds
>>
>>  long s = System.currentTimeMillis();
>>  TermFreqVector termFreqVector;
>>     for (int i = 0; i < 1030; i++) {
>>       if (!reader.isDeleted(i)) {
>>        termFreqVector   = reader.getTermFreqVector(i, field);
>>        }
>>     }
>>     long l = System.currentTimeMillis();
>>
>>
>> Hardware and Memory Settings:
>> -------------------------------------------
>> -Xmx 2048m -XX:PermSize=16m -XX:MaxPermSize=128m
>>
>> Dual 1800 MHz Optron on 32 bit Linux 2.6.15.2; Lucene 2.0.0.
>>
>>
>>
>>
>> How can I get better results? Can I?
>>
>>
>>
>> Many thanks for your help.
>> -Amit
>>
>>
>>
>>
>>
>> ---------------------------------------------------------
>> Amit Kumar
>> Research Programmer
>> The Graduate School of Library and Information Science
>> University of Illinois, Urbana Champaign IL, 61820
>> phone: 217-333-4118 fax: 217-244-3302
>> ---------------------------------------------------------
>>
>>
>>
>>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
---------------------------------------------------------





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message