lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <>
Subject Re: a faster way to addDocument and get the ID just added?
Date Wed, 30 Mar 2011 09:21:21 GMT
On Wed, Mar 30, 2011 at 8:14 AM, Li Li <> wrote:
> merge will also change docid
> all segments' docId begin with 0

for all released version this is not true. Before trunk (and I think
its in 3.1 also) merge only merged continuous segments so the actual
per-segment ID might change but the global document ID doesn't if you
only add documents. But this should not be considered a feature. In
upcoming version this does not work anymore since merges can now be

Anyway, I strongly discourage to rely on lucene document IDs you
should not do this at all. Can't you use your own ID mechanism?

> 2011/3/30 Trejkaz <>:
>> On Tue, Mar 29, 2011 at 11:21 PM, Erick Erickson
>> <> wrote:
>>> I'm always skeptical of storing the doc IDs since they can
>>> change out from underneath you (just delete even a single
>>> document and optimize).
>> We never delete documents.  Even when a feature request came in to
>> update documents (i.e. delete the old one and add a new version), we
>> ended up keeping the old version around, partially because we didn't
>> want the IDs to shift (which is a bit of a recursive argument), but
>> also because it's forensically sound to have the previous versions
>> around so people can see what edits were made.
>>> What is it you're doing with the doc ID that you couldn't do with the guid? If
your "guid list"
>>> were ordered, I can imagine building filters quite quickly from
>>> it using TermDocs.skipTo for instance..
>> The main problem with filters is that DocIdBitSet's iterator has to
>> return the doc IDs in order.
>> Even if our GUIDs are in order (they would be, as it would be the
>> primary key on tables using them), they won't be in the same order as
>> the IDs of the docs they came from.  So for each row in the ResultSet,
>> you need to do a  This not only costs the
>> additional I/O (and it's a lot more than the original database query
>> was), but you have to read every row in the ResultSet just to get the
>> first doc ID.
>> Contrast this with using doc IDs for the database query.  You don't
>> need to hit the index at all since you already have the result.  And
>> the docs come back in order, so you don't even have to iterate the
>> entire result set - you can read the first 100 rows and then read more
>> rows if/when they are needed.  And if the caller is using skipTo then
>> this can be incorporated into the database query to avoid returning
>> rows which are only going to be discarded anyway.
>> Integer fields should have improved things a little in terms of the
>> amount of I/O required to do the query (at least I would hope that
>> this is the case - I haven't done any tests yet and we can't use them
>> yet for backwards compatibility reasons) but they don't remove the
>> problem of needing to iterate every document in the result set
>> up-front.
>> TX
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message