lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <ysee...@gmail.com>
Subject Re: Are Non-consecutive Document IDs feasible?
Date Tue, 11 Oct 2005 16:15:22 GMT
Yes, lucene depends on consecutive docids.

For the query side, the following thjings come to mind.
- for sorting, the FieldCache allocates arrays up to maxDoc()
- for deleted documents, it's a BitVector up to maxDoc()
- Some queries like MatchAllDocumentsQuery do a linear scan through deleted
documents

Just add a field to every document that will act as the id. If you need more
performance you could cache the mapping from external_id -> internal_id.

-Yonik
Now hiring -- http://tinyurl.com/7m67g

On 10/11/05, Shane O'Sullivan <shaneosullivan1@gmail.com> wrote:
>
> Hi all,
>
> As far as I understand today, Lucene assigns docIDs to documents according
> to the order in which the documents are added to the index. Hence, docIDs
> are assigned by the engine in a sequential manner, without gaps. This
> order
> of document identifiers then determines the order of the postings in the
> postings lists, i.e. all postings lists are sorted by docID. It also means
> that the same document appearing in two different indices would probably
> not
> have the same docID (unless some extreme care was taken to insert
> documents
> in the same order).
>
> There are situations where the application wants to determine the docID
> for
> the index, i.e. to control the ordering of occurrences in the postings
> lists. This is useful to ensure, for example, that a document has a stable
> and consistent document identifier regardless of insertion order to an
> index.
>
> In either case, the application would want to pass into the index the
> numeric identifier of the document. However, such identifiers may not be
> sequential, i.e. it's possible that there would be a document with docID M
> without there being any document whose docID is M-1.
>
> Q1. How difficult would it be to change Lucene to accept the docIDs from
> the
> application, and not care about any possible gaps those ids may have?
> One possible problem is that since the Doc Ids could become very large,
> and
> are non-sequential, creating a single array for them all would not be
> feasible.
>
> Q2. Does Lucene's search code depend on the fact that document IDs are
> sequential?
>
> Thanks
>
> Shane
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message