lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dalton, Jeffery" <jdal...@globalspec.com>
Subject RE: Lucene performance bottlenecks
Date Thu, 08 Dec 2005 15:12:26 GMT
 
Andrzej, I think you did a great job elucidating my thoughts as well.  I
heartily concur with everything you said.  

Andrzej Bialecki Wrote:
> Hmm... Please define what "adequate" means. :-) IMHO, 
> "adequate" is when for any query the response time is well 
> below 1 second. Otherwise the service seems sluggish. 
> Response times over 3 seconds are normally not acceptable. 
> This is just for a single concurrent query - the number of 
> concurrent queries will be a function of the number of 
> concurrent users, and the search response time, until it 
> reaches the limit of the number of threads on the search 
> servers. Then, the time it takes to return the results should 
> give us the maximum concurrent query-per-second estimate.

Agreed!
 
> What I found out is that "usable" depends a lot on how you 
> test it and what is your minimum expectation. There are some 
> high-frequency terms (and by this I mean terms with frequency 
> around 25%) that will consistently cause a dramatic slowdown. 
> Multi-term queries, because of the way Nutch expands them 
> into sloppy phrases, may take even more time, so even for 
> such relatively small index (from the POV of the whole
> Internet!) the response time may drag into several seconds 
> (try "com").
> 
> >
> > Perhaps your traffic will be much higher than the Internet 
> Archive's, 
> > or you have contractual obligations that specify certain 
> average query 
> > performance, but, if not, ~10M pages is quite searchable 
> using Nutch 
> > on a single CPU.
> 
...
> If 10 mln docs is too much for a single server to meet such a 
> performance target, then this explodes the total number of 
> servers required to handle Internet-wide collections of 
> billions pages...

Most definitely.   Here is a target I would like to set: ~50 million
pages per server that can handle query rates of 2-3 queries per second
with response times still being on average sub-second.    
 
> So, I think it's time to re-think the query structure and 
> scoring mechanisms, in order to simplify the Lucene queries 
> generated by Nutch - or to do some other tricks...

Yes! Andrzej, do you have thoughts on how this could be re-structured?

Here are some ideas that I had ...
Could we expand the use of N-Grams so that all occurrences of
significant multi-word phrases are indexed (up to a certain length)?
This would index space for query time.

Next, let's examine a very simple query:
+(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5
host:term1^2.0)
Let me re-iterate some discussion on this topic:

Paul Elschot wrote:
>The first two clauses with + prefixes are required, and I would guess
that the 5 way disjunctions inside these clauses take most of the cpu
time during
>search.

Andrzej Bialecki wrote:
>>That's an interesting observation. This suggests that it could pay off
to glue these fields together and change
>>this to a query on a single combined field, right? I.e. to trade off
space for speed.

Andrzej Bialecki wrote:
>Unfortunately, that's not the option... Nutch uses these clauses to
affect the final score value, i.e. it has to
>use different fields in order to apply different boost values per
field, both in the query and in the encoded 
>fieldNorms.

What changes could be made to make this possible?  It might be as you
said, that this would require some re-thinking and re-structuring of
Nutch's query behavior.  Beyond glueing the fields together, another
more radical idea might be to combine all of those field scores together
into a single term-document score at index time. 

Computing the term-document score across multiple fields at index time
would offload much of the CPU work necessary to rank and return
documents.  Addditionally,  the index could be sorted by the
term-document score with the best matching documents listed first,
providing an additional speed-up. The biggest drawback of this approach
is that you couldn't change weights/ranking parameters easily at query
time.  Instead, if you wanted to change your ranking factors you would
have to re-index the documents.  However, the CPU and performance
benefit of this at query time might well be worth the inconvenience of
re-indexing for re-ranking for very large scale implementations.    

In summary, it seems like Nutch performs the document scoring almost
entirely at query time and this behavior has an adverse affect upon
performance.  One alternative is to perform much of this calculation at
index time.  There is also some middle ground, such as glueing the
fields together as Andzej suggested.  No matter what change is made, I
agree with Andrzej that the query structure and scoring mechanisms
require some simplification in order to achieve acceptable performance
at scale.
 
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web 
> ___|||__||  \|  ||  |  Embedded Unix, System Integration 
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message