lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Lucene performance bottlenecks
Date Mon, 12 Dec 2005 22:40:28 GMT
Paul Elschot wrote:

>There is one indexing parameter that might help performance
>for BooleanScorer2, it is the skip interval in Lucene's TermInfosWriter.
>The current value is 16, and there was a question about it
>on 16 Oct 2005 on java-dev with title "skipInterval".
>I don't know how the value of skipInterval was initially determined.
>It's possible that a larger value gives somewhat better query
>performance in this case.
>Changing the skip interval might require reindexing, though.

In Nutch the default is 128. And yes, changing this requires re-creating 
the index (actually, it's enough to optimize it, so that the .tii file 
is re-written).

>I considered a specialised scorer for the earlier query:
>+(url:term1^4.0 anchor:term1^2.0 content:term1
>   title:term1^1.5  host:term1^2.0)
>+(url:term2^4.0 anchor:term2^2.0 content:term2
>   title:term2^1.5 host:term2^2.0)
>url:"term1 term2"~2147483647^4.0 
>anchor:"term1 term2"~4^2.0
>content:"term1 term2"~2147483647
>title:"term1 term2"~2147483647^1.5
>host:"term1 term2"~2147483647^2.0

Thank you for the detailed analysis. Currently we pursue a totally 
different approach: limiting the size of the index by clever selection 
of the most promising postings, and resorting the posting lists so that 
they are ordered according to a "pagerank"-like value, so that we could 
skip postings coming from less significant docs. Please see the 
nutch-dev discussion for more details.

Oh, BTW:  I just found the DisjunctionMaxQuery class, recently added it 
seems. Do you think this query structure could benefit from using it 
instead of the BooleanQuery?

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message