lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: What's the bottleneck?
Date Fri, 12 Sep 2008 19:58:00 GMT
>Thanks for all the replies!
>
>Mike: we're not using pf.  Our qf is always "status:0".  The "status" field
>is "0" for all good docs (90%+) and some other integer for any docs we don't
>want returned.
>
>Jeyrl: federated search is definitely something we'll consider.
>
>On Fri, Sep 12, 2008 at 8:39 AM, Grant Ingersoll <gsingers@apache.org>wrote:
>
>>  The bottleneck may simply be there are a lot of docs to score since you are
>>  using fairly common terms.
>
>Yeah, I'm coming to the realization that it may be as simple as that.  Even
>a short, simple query like "shirt" can take seconds to return, presumably
>because it hits ("numFound") 2 million docs.
>
>
>>  Also, what file format (compound, non-compound) are you using?  Is it
>>  optimized?  Have you profiled your app for these queries?  When you say the
>>  "query is longer", define "longer"...  5 terms?  50 terms?  Do you have lots
>>  of deleted docs?  Can you share your DisMax params?  Are you doing wildcard
>>  queries?  Can you share the syntax of one of the offending queries?
>
>
>I think we're using the non-compound format.  We see eight different files
>(fdt, fdx, fnm, etc.) in an optimized index.  Yes, it's optimized.  It's
>also read-only---we don't update/delete.  DisMax: we specify qf, fl, mm, fq;
>mm=1; we use boosts for qf.  No wildcards.  Example query: "shirt"; takes 2
>secs to run according to the solr log, hits 2 million docs.
>
>
>  > Since you want to keep "stopwords", you might consider a slightly better
>>  use of them, whereby you use them in n-grams only during query parsing.
>
>
>Not sure what you mean here...

You might want to look at how Nutch handles this issue. Nutch also 
has stopwords that it wants to keep around. So what it does is 
generates combo terms like the-<next term> in the index. The query 
parser does the same thing, so that if your query phrase has common 
terms, you wind up searching across a much smaller slice of your 
index.

This comes, of course, at the expense of a larger index with a lot 
more unique terms (due to all of the combo terms).

But this can be a big win - for example, at our site 
(http://www.krugle.org) we index source files. Without this 
optimization, searches could take several seconds. With it, we got 
down to < 100ms with lots of breathing room.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Mime
View raw message