lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@googlemail.com>
Subject Re: Lucene's internal doc ID space
Date Sat, 12 May 2012 08:36:05 GMT
On Fri, May 11, 2012 at 7:56 AM, Jong Kim <jong.lucene@gmail.com> wrote:
> When I update a document in Lucene (i.e., re-indexing), I have to delete
> the existing document, and create a new one. My understanding is that this
> assigns a new doc ID for the newly created document. If that is the case,
> is it true that the system can rather quickly run out of doc ID space
> (which is about 2 billion since doc ID data type is integer) if the update
> frequency is extremly high in an application?

the Document IDs in Lucene are per segment. ie. they are always
segment based. There is certainly a limitation here that is 1. in the
API ie. all methods accepting internal doc ids expect int not long. 2.
on a segment level. Basically you gonna run into problems if you have
more than Integer.MAX_VALUE documents in one index. You can work
around that if everything is "per-segment", in such a case the
limitation only applies to a single segment.

Running out of "ids" won't be an issue as they are all relative
per-segment. ie. you can forever update a single document and don't
run out of ids.
>
> So, my question is -
>
> 1. Does Lucene always increment the doc ID for newly created document
> (hence, the risk of running out of ID space) just like auto increment
> column in the database does? Or does it re-use the numbers that are
> currently not in use (i.e. those IDs that were once assigned but since
> deleted)?
>
> 2. If Lucene can recycle old IDs, it would be even better if I could force
> it to re-use a particular doc ID when updating a document by deleting old
> one and creating new one. This scheme will allow me to reference this doc
> ID from another doc in the index as if it was a foreign key value that
> doesn't change upon reindexing. I didn't see anything like this in the API,
> but is it ever possible?
>
> 3. If Lucene does not recycle old IDs, how do people deal with this issue
> when designing a system with extremely high re-indexing frequency?

the lucene internal ids should not be used in the application
integrating lucene or at least not in a way you would use a primary
"auto-incremented" key in a DB. you can specify your own "id" field
and reuse the ids (you actually have to if you want to update.

does that make sense?

simon
>
> Thanks in advance for help
> /Jong

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message