lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Lucene performance bottlenecks
Date Wed, 07 Dec 2005 09:51:49 GMT
Paul Elschot wrote:

>On Saturday 03 December 2005 14:09, Andrzej Bialecki wrote:
>  
>
>>Paul Elschot wrote:
>>
>>    
>>
>>>In somewhat more readable layout:
>>>
>>>+(url:term1^4.0 anchor:term1^2.0 content:term1
>>>  title:term1^1.5  host:term1^2.0)
>>>+(url:term2^4.0 anchor:term2^2.0 content:term2
>>>  title:term2^1.5 host:term2^2.0)
>>>url:"term1 term2"~2147483647^4.0 
>>>anchor:"term1 term2"~4^2.0
>>>content:"term1 term2"~2147483647
>>>title:"term1 term2"~2147483647^1.5
>>>host:"term1 term2"~2147483647^2.0
>>>
>>>The first two clauses with + prefixes are required, and I would guess
>>>that the 5 way disjunctions inside these clauses take most of the cpu time
>>>during search.
>>> 
>>>
>>>      
>>>
>>That's an interesting observation. This suggests that it could pay off 
>>to glue these fields together and change this to a query on a single 
>>combined field, right? I.e. to trade off space for speed.
>>    
>>
>
>Yes.
>  
>

Unfortunately, that's not the option... Nutch uses these clauses to 
affect the final score value, i.e. it has to use different fields in 
order to apply different boost values per field, both in the query and 
in the encoded fieldNorms.

>Querying the host field like this in a web page index can be dangerous
>business. For example when term1 is "wikipedia" and term2 is "org",
>the query will match at least all pages from wikipedia.org.
>
>  
>

Well, that's the idea - all pages in wikipedia.org are somehow relevant 
to a query "wikipedia org". How relevant depends on the weights of 
individual clauses (in addition to the usual tf / idf / fieldNorm)..

>>>The remaining clauses will be skipped to only when the two required
>>>clauses are both present, so these are probably not the problem.
>>>In the call tree for scoring these can be identified by the skipTo()
>>>being called inside the score() method at the top level.
>>>
>>>This is one of the cases in which BooleanScorer2 can be faster
>>>than the 1.4 BooleanScorer because the 1.4 BooleanScorer does
>>>not use skipTo() for the optional clauses.
>>>Could you try this by calling the static method
>>>BooleanQuery.setUseScorer14(true) and repeating the test?
>>>      
>>>


As far as I can tell it doesn't make any statistically significant 
difference - all search times remain nearly the same. If anything the 
test runs with useScorer14 == true are fractionally faster.

>>>>>From the hot spots output of the profiler info I see that the following

>>>      
>>>
>methods:
>  
>
>>>- PhrasePositions.nextPosition
>>>- SegmentTermDocs.read
>>>have a larger portion of cpu time spent in the method.
>>>Looking at the expanded query, seeing PhrasePositions.nextPosition
>>>here is not a surprise.
>>>
>>>The fact that SegmentTermDocs.read shows up might explain
>>>the reason why only a little heap is used:  Lucene normally leaves
>>>the file buffering to the operating system, and
>>>when the file buffers are read the index data is decompressed
>>>mostly by the readVInt method.
>>> 
>>>
>>>      
>>>
>>Yes, I understand it now. But perhaps it's time to contest this 
>>approach: if there is so much heap available, does it still make sense 
>>to rely so much on OS caching, if we have the space to do the caching 
>>inside JVM (at least for a large portion of the index)?
>>    
>>
>
>You can try and use Lucene's RAMDirectory when you have enough RAM.
>However, for larger indexes, It is easier to fit the index files  in RAM
>(in OS buffers) than the decompressed index data (inside the JVM).
>
>Also, caching query results by query text is probably more effective
>than caching the JVM version of the searched index data.
>  
>

The problem here is not with the disk I/O, so I don't think RAMDirectory 
would help, even if I had the required 30GB of RAM - the problem is with 
the number of invocations of readVInt(), which for my test scenarios was 
called on average 2.5 times for a single doc in the index, per query (in 
this case, for a single query run against a 10 mln docs index it was 
invoked ~25 mln times). I tried changing the TermIndexInterval (I tested 
it with values ranging from 16 to 512), but there were no significant 
differences in speed, although of course the memory consumption was very 
different. I would happily trade a lot of heap space in order to 
increase performence of such complex queries. But at the moment there is 
no way to do it that I can see...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message