lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "fp235-5" <julien.nio...@lingway.com>
Subject Re: suggestion for a CustomDirectory
Date Fri, 05 Dec 2003 20:53:18 GMT
Hello Doug,

I can send you an example of the queries I'm building. It can be very large...
Indexes are always optimized.
All Term or Phrase Queries inside a BooleanQuery are sorted and indeed it speeds
up things a little. However sorting the terms inside a PhraseQuery is quite
limited (but possible if order does not matter). If I had a single BooleanQuery
(let's say OR) ordering the Terms would improve a lot but unfortunately the
Queries I send are made of enclosing Booleans on up to 3 or 4 levels. 

I found as well that disabling the idf by using a custom Similarity object
improves a little bit in terms of speed. 

If I understand well, changing the TermInfosWriter.INDEX_INTERVAL would create a
bigger .tii file and thus more Term objects would be available in memory. I'll
try this to see what impact it has on the performance of my app.

By "creation of temporary Term objects" I meant the whole process of finding a
given Terms  (i.e. parsing, creation, comparison). Dmitry's patch improved this
part a lot and in my case reduced by 10-15% the overall time. Sadly it has never
been included  in the source and could have been useful for all kind of users.

The idea behind the CustomDirectory is to kill two birds with one stone : 
1/ escape an all or nothing approach (all on FS or all on RAM) by putting often
used information in memory + choose the kind of approach at reading time.
2/ avoid useless creation/destruction of objects and improve access to Term
objets (which do not have do be accessed sequentially)

Thank you very much Doug for suggesting the use of INDEX_INTERVAL! I'll try it
on Monday

good week end everybody

Julien


---------- Debut du message initial -----------

De     : Doug Cutting <cutting@lucene.com>
A      : Lucene Developers List <lucene-dev@jakarta.apache.org>
Copies : 
Date   : Fri, 05 Dec 2003 10:12:56 -0800
Sujet  : Re: suggestion for a CustomDirectory

Julien Nioche wrote:
> Profiling my application indicates that a lot of times is spent for the
> creation of temporary Term objects.

It does indeed look like term lookup is using a lot of your time.  I 
don't see the Term constructor showing up as significant in your 
profile, so it looks to me like it could just the cost of parsing the 
data, not the allocation/GC stuff.  I've found that allocation of 
temporary objects doesn't really cost much with modern garbage 
collectors.  The biggest cost of allocating objects is sometimes just 
the constructor.

What sort of queries are you making against what sort of an index?  It 
looks like you're probably making large queries with lots of 
low-frequency terms, in order for term lookup to be such a large factor. 
  You might try sorting the terms in the query.  If subsequent lookups 
are nearby in the TermInfo file then it won't have to scan as much. 
Could that help?  Also, is your index optimized?  An optimized index 
will drastically reduce the term lookup costs.

If all these fail, try reducing TermInfosWriter.INDEX_INTERVAL.  You'll 
have to re-create your indexes each time you change this constant.  You 
might try a value like 16.  This would keep the number of terms in 
memory from being too huge (1 of 16 terms), but would reduce the average 
number scanned from 64 to 8, which would be substantial.  Tell me how 
this works.  If it makes a big difference, then perhaps we should make 
this parameter more easily changable.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message