lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <jason.rutherg...@gmail.com>
Subject Re: speed of BooleanQueries on 2.9
Date Thu, 16 Jul 2009 19:02:48 GMT
> caching them (as OpenBitSet)

How do you handle stop words in phrase queries?

On Thu, Jul 16, 2009 at 11:30 AM, eks dev<eksdev@yahoo.co.uk> wrote:
>
> Sure, If you have enough memory to do postings caching, with or without P4... I see P4
as a generally faster postings format, with stopwords or not.
>
> I wouldn't blow Term dictionary, that just moves the problem to another place.
>
> What I am thinking of is quite simple, probably not the most elegant solution, but I
am almost sure it would work:
> - get Top N terms from index, N depends on your available memory
> - create Filter from them, stick them into ConstantScoreQuery, caluculate idf() and set
boost() to this value, cache it
> - implement QueryOptimizer that loops all Terms in your Query and replaces Terms with
cached  ConstantScoreQuery
>
> and voila, your made perfectly fast search... but
>
> BAD:
> a) you reduce quality of your score value, as there is no tf() component. But for stop
words, I am not sure if that makes any significant  difference. Also, if you are luck like
me, you omitTf()... so no loss there
> b) if you load RAMIndex/MMAp, you duplicate ram needs for these postings...
>
> COOL:
> - Math on out index: Zipfian distribution does magic, top 30 terms make 36% of our corpus!
For caching them (as OpenBitSet) on 100Mio Documents  I need ~0.35G
> My terms distribution follows collection terms distribution ... so I get cache hit rate
of 36% for only 0.38Gb ram... You save a lot of VInt decoding (brings a lot, even if we ignore
benefit of reducing disk access... these hot terms must be OS cached anyhow). If you use something
other filter, you need even less memory... it is only important to use filter that is measurably
faster than VInt decoding with skip lists.
> - This speeds up the slowest queries, fast queries are anyhow fast :)
>
>
> I think it will work just fine
>
> Would be great if Lucene could do all this for me, I just say "here, I give you 500Mb
free for postings cache, do your magic for me"... but nothing prevents me to provide patch
:)
>
> I will try it, to see if theory works.We have cases where free memory is not a problem,
we are hitting CPU there (VInt decoding on our last profiled run). To be honest, I do not
know is anyone today runs high volume search from disk (maybe SSD), even than, significant
portion has to be in RAM...
>
> One day we could throw many CPUs at Query... but this is not an easy one...
>
>
>
>
>
> ----- Original Message ----
>> From: Jason Rutherglen <jason.rutherglen@gmail.com>
>> To: java-user@lucene.apache.org
>> Sent: Thursday, 16 July, 2009 19:22:28
>> Subject: Re: speed of BooleanQueries on 2.9
>>
>> Do we think that we'll be able to support indexing stop words
>> using PFOR (with relaxation on the compression to gain
>> performance?) Today it seems like the best approach to indexing
>> stop words is to use shingles? However this blows up the term
>> dict because shingles concatenates phrases together.
>>
>> On Thu, Jul 16, 2009 at 8:26 AM, eks devwrote:
>> >
>> > We did it for us, gave something back to community... all happy... open source
>> works just fine here in lucene land :)
>> >
>> > Re, 10%
>> > I did not expect that much, but our index is quite dense, a lot of documents
>> and not too many unique terms, omitTf ... so it is really hard pressure on
>> DocIDSetIterator and Scorers.
>> >
>> > I cannot wait to see P4, pulsing index... in action...
>> > We are alo going to try to cache postings for Top N high freq. terms in plain
>> old ConstanScoreQuery via OpenBitSet ... with zipfian distribution this should
>> reduce VInt decoding to 50% with just a few hundred terms... having TF
>> independent score, we just need to adjust constant score value based on idf()...
>> so no loss in quality! expected huge performance benefit (said optimist without
>> numbers to prove it).
>> >
>> > Cheers, Eks
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > ----- Original Message ----
>> >> From: Michael McCandless
>> >> To: java-user@lucene.apache.org
>> >> Sent: Thursday, 16 July, 2009 16:23:57
>> >> Subject: Re: speed of BooleanQueries on 2.9
>> >>
>> >> Super, thanks for testing!
>> >>
>> >> And, the 10% speedup overall is good progress...
>> >>
>> >> Mike
>> >>
>> >> On Thu, Jul 16, 2009 at 9:16 AM, eks devwrote:
>> >> >
>> >> > and one final touch, 4X slow down does not exist with new Lucene...
>> >> > I did not verify it again on the old one, but hey, who cares. Trunk
is
>> clean
>> >> and, at least so far, our favourite QA team has nothing to complain about
...
>> >> >
>> >> > They will keep it under stress for a while... so if somethings comes
up you
>> >> will hear from me...
>> >> > Thanks again to all.
>> >> >
>> >> > Cheers, Eks
>> >> >
>> >> >
>> >> >
>> >> > ----- Original Message ----
>> >> >> From: eks dev
>> >> >> To: java-user@lucene.apache.org
>> >> >> Sent: Thursday, 16 July, 2009 14:40:26
>> >> >> Subject: Re: speed of BooleanQueries on 2.9
>> >> >>
>> >> >>
>> >> >> ok new facts, less chaos :)
>> >> >>
>> >> >> - LUCENE-1744 fixed it definitely; I have it confirmed
>> >> >> Also, we found another example of the Query that was stuck (t1
t2 t3)~2
>> ...
>> >> this
>> >> >> is also fixed with LUCENE-1744
>> >> >>
>> >> >>
>> >> >> Re:  "some queries are 4X slower  than before".  Was that a
different
>> issue?
>> >> >> (Because this issue is "the query runs forever").
>> >> >>
>> >> >> Maybe :) I do not know.
>> >> >> When I wrote this email about "the query runs forever" I did not
know if
>> this
>> >> >> slowdown is the same or different issue... I have just reported
some
>> unusual
>> >> >> observation (4 times slower) and was later convinced that this
stuck Query
>> >> >> confirms the same problem ....
>> >> >>
>> >> >> Now, I do not know  if that was the same effect, or wrong measurement,
or
>> >> >> something else lurking ... Good point, will try to repeat test
on this
>> >> >> slowdown...
>> >> >>
>> >> >> Just a reminder This 4_times_slower Query is different:
>> >> >> +(a b c) +(x y z)
>> >> >>
>> >> >> +((NAME:hans NAME:hahns^0.23232001 NAME:hams^0.27648002 NAME:hamz^0.25392
>> >> >> NAME:hanas^0.18722998 NAME:hanbs^0.18722998 NAME:hanfs^0.18722998
>> >> >> NAME:hangs^0.18722998 NAME:hanhs^0.24030754 NAME:hanis^0.18722998
>> >> >> NAME:hanjs^0.18722998 NAME:hanks^0.18722998 NAME:hanms^0.18722998
>> >> >> NAME:hanos^0.18722998 NAME:hanrs^0.18722998 NAME:hansb^0.20172001
>> >> >> NAME:hansd^0.20172001 NAME:hansf^0.20172001 NAME:hansg^0.20172001
>> >> >> NAME:hansi^0.20172001 NAME:hansj^0.20172001 NAME:hansk^0.20172001
>> >> >> NAME:hansl^0.20172001 NAME:hansn^0.20172001 NAME:hanso^0.20172001
>> >> >> NAME:hansp^0.20172001 NAME:hanst^0.20172001 NAME:hansu^0.20172001
>> >> >> NAME:hansw^0.20172001 NAME:hansy^0.20172001 NAME:hansz^0.20172001
>> >> >> NAME:hants^0.18722998 NAME:hanus^0.18722998 NAME:hanws^0.18722998
>> >> >> NAME:hehns^0.20172001 NAME:hens^0.2736075 NAME:hins^0.24843
>> NAME:hons^0.24843
>> >> >> NAME:huhns^0.1801875 NAME:huns^0.24843)^2.0)
>> >> >> +(((ZIPS:berlin ZIPS:barlin^0.28227 ZIPS:berien^0.25947002
>> >> >> ZIPS:berling^0.23232001 ZIPS:perlin^0.26133335))^1.2)
>> >> >>
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> ----- Original Message ----
>> >> >> > From: Michael McCandless
>> >> >> > To: java-user@lucene.apache.org
>> >> >> > Sent: Thursday, 16 July, 2009 13:52:06
>> >> >> > Subject: Re: speed of BooleanQueries on 2.9
>> >> >> >
>> >> >> > On Thu, Jul 16, 2009 at 6:38 AM, eks devwrote:
>> >> >> >
>> >> >> > > and this String has exactly that form
>> >> >> > > (x OR y OR z) OR (a OR b OR c),
>> >> >> > > That is exactly how I construct the Query, have a look
at brackets on
>> >> this
>> >> >> > toString result .
>> >> >> >
>> >> >> > Duh!  OK, I had missed that your large query actually had
2 clauses at
>> >> >> > the top!  Sigh.
>> >> >> >
>> >> >> > OK, that part of the puzzle now at least makes sense.  The
rewrite()
>> >> >> > of your query will not reduce to a single OR query (as I previously
>> >> >> > thought).
>> >> >> >
>> >> >> > So in fact you have a BS at the top (because you called
>> >> >> > setAllowDocsOutOfOrder(true)), with 2 clauses, and each of
those
>> >> >> > clauses uses BS2 to score.
>> >> >> >
>> >> >> > I think advance() is not involved, but LUCENE-1744 could very
well
>> >> >> > have fixed this, because BS calls sub.scorer.docID() when
interacting
>> >> >> > with its sub-scorers, and due to LUCENE-1744, that would always
return
>> >> >> > -1 from a BS2, so BS could enter an infinite loop.
>> >> >> >
>> >> >> > If you run w/o the fix for LUCENE-1744, with my instrumentation,
I can
>> >> >> > confirm this.  But I think likely this is it.
>> >> >> >
>> >> >> > Also: you started this thread by saying "some queries are
4X slower
>> >> >> > than before".  Was that a different issue?  (Because this
issue is
>> >> >> > "the query runs forever").
>> >> >> >
>> >> >> > Mike
>> >> >> >
>> >> >> > ---------------------------------------------------------------------
>> >> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message