lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <>
Subject Re: a faster way to addDocument and get the ID just added?
Date Wed, 30 Mar 2011 01:46:43 GMT
On Tue, Mar 29, 2011 at 11:21 PM, Erick Erickson
<> wrote:
> I'm always skeptical of storing the doc IDs since they can
> change out from underneath you (just delete even a single
> document and optimize).

We never delete documents.  Even when a feature request came in to
update documents (i.e. delete the old one and add a new version), we
ended up keeping the old version around, partially because we didn't
want the IDs to shift (which is a bit of a recursive argument), but
also because it's forensically sound to have the previous versions
around so people can see what edits were made.

> What is it you're doing with the doc ID that you couldn't do with the guid? If your "guid
> were ordered, I can imagine building filters quite quickly from
> it using TermDocs.skipTo for instance..

The main problem with filters is that DocIdBitSet's iterator has to
return the doc IDs in order.

Even if our GUIDs are in order (they would be, as it would be the
primary key on tables using them), they won't be in the same order as
the IDs of the docs they came from.  So for each row in the ResultSet,
you need to do a  This not only costs the
additional I/O (and it's a lot more than the original database query
was), but you have to read every row in the ResultSet just to get the
first doc ID.

Contrast this with using doc IDs for the database query.  You don't
need to hit the index at all since you already have the result.  And
the docs come back in order, so you don't even have to iterate the
entire result set - you can read the first 100 rows and then read more
rows if/when they are needed.  And if the caller is using skipTo then
this can be incorporated into the database query to avoid returning
rows which are only going to be discarded anyway.

Integer fields should have improved things a little in terms of the
amount of I/O required to do the query (at least I would hope that
this is the case - I haven't done any tests yet and we can't use them
yet for backwards compatibility reasons) but they don't remove the
problem of needing to iterate every document in the result set


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message