lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <>
Subject Re: Usage of NoMergePolicy and its potential implications
Date Wed, 25 Jul 2012 21:42:24 GMT
On Thu, Jul 26, 2012 at 5:38 AM, Simon Willnauer
<> wrote:
> you really shouldn't do that! If you use lucene as a Primary key
> generator why don't you build your own on top. Just add one layer that
> accepts the document and returns the PID and internally put it in an
> ID field. Using no merge policy is not a good idea either you will
> very likely reach system boarders (# file descriptors) and suffer from
> bad search performance and low compression.
> I think you should really consider fixing your app instead of hacking lucene.

I can understand how they would end up in this situation since we
ended up in it as well.

We tried using our own ID (which we still have in Lucene and still use
for other purposes), and it slows down some things.

For example, when building bit sets for filters based on the external
database, now you have to look up every ID you get back. Because you
don't know if the last row returned from the query might be Lucene doc
ID 0, you can't build the filter at all unless you process every row
returned from the query.

If you had a million docs returned by the SQL query, you had to do a
million term lookups in Lucene. We didn't have enough memory to store
the mapping from our ID back to Lucene's (OOME as soon as you tried to
make a map to look things up faster), which made it impossible to
cache the information at the time. I'm not sure if it's getting easier
or harder - memory sizes are increasing but the number of docs people
are putting into the indexes is increasing as well.

At the time, Lucene developers were adamant that we shouldn't be using
the doc ID because deleted doc IDs eventually get reused (or rather
all the IDs shifted downwards) but since we never physically delete
doc IDs (we want a history of item modification including deletion, so
doing that would be undesirable anyway) it was never a problem until
the new merging came along.

I guess while the doc ID is still available, people will continue to
use it. If it disappeared from the API completely, this would be good
encouragement to migrate to a different approach. :)


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message