lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ype Kingma <ykin...@xs4all.nl>
Subject Re: Quickest way to build a Document - (Keyword, Freq)* map
Date Fri, 14 Feb 2003 22:06:11 GMT
On Friday 14 February 2003 15:10, you wrote:
> Hi,
>
> I am using Lucene right now to index several semi-structured documents. I
> recently had to implement a method 'getFrequencyVector()' to simply return
> a mapping of keyword -> frequency from the information already in the
> lucene index.
>
> I currently maintain the lucene index on basis of the keyword -> (document,
> freq)* mapping. The best solution I could come up with is to iterate over
> all the keywords ( :( ) match my own document identifier and build the
> vector. Any ideas/suggestions?
>
> Is there a way to speed up the vector computation? It currently takes a
>
> |k|*|d| where |k| is the total number of keywords indexed and |d| is the
>
> average number of documents a keyword can occur in.

The TermDocs.skipTo(docNr) mentions:  "Some implementations are considerably 
more efficient than that", ie. more efficient than linear.
With the current index format of lucene, this is not possible at less than 
O(|d|) cost, because the term frequency (.frq) file is only indexed from the 
term file, ie. by term, and not by (term,document nr) pairs. This might be 
useful for terms occurring in larger portion of the documents.

So, if I understand this correctly, skipTo(docNr) might help in performance,
but from what I know you cannot currently do better than O(|d|) per term.

Kind regards,
Ype

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message