lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <>
Subject Re: a faster way to addDocument and get the ID just added?
Date Thu, 31 Mar 2011 19:10:01 GMT
Hey Ian,

On Thu, Mar 31, 2011 at 11:32 AM, Ian Lea <> wrote:
>>> Subject: a faster way to addDocument and get the ID just added?
> Might it be possible to come up with a version of
> IndexWriter.addDocument() that returns the docid rather than void?
> Answering that question is way out of my league, but it would
> presumably be quick.
With the current trunk I think we could do that since doc IDs are
assigned in DocumentsWriter and we only have one instance of this
although we are indexing into multiple mem segments and merge on
flush. But, (yeah there must be a but :) we are working on
DocumentsWriterPerThread to exploit extra concurrency for 4.0 where
this doesn't work anymore since we indexing a segment per thread which
is not merged on flush but written directly to disk.

With 4.0 we will also have Column Stride Fields which might help with
this issue which enables you to use your own docIds stored in a fast
accessible integrated column based storage. It might not be as fast as
docIds directly but reasonable since it can be access with a low
footprint iterator during scoring.

maybe that helps once 4.0 is there

> --
> Ian.
> On Thu, Mar 31, 2011 at 6:34 AM, Trejkaz <> wrote:
>> On Wed, Mar 30, 2011 at 8:21 PM, Simon Willnauer
>> <> wrote:
>>> Before trunk (and I think
>>> its in 3.1 also) merge only merged continuous segments so the actual
>>> per-segment ID might change but the global document ID doesn't if you
>>> only add documents. But this should not be considered a feature. In
>>> upcoming version this does not work anymore since merges can now be
>>> non-continuous.
>> This myth was busted some time ago:
>> Summary: selecting segments to merge is decided by MergePolicy, and a
>> MergePolicy which does not upset ordering will be remain in existence.
>>> Anyway, I strongly discourage to rely on lucene document IDs you
>>> should not do this at all. Can't you use your own ID mechanism?
>> This has pretty much already been covered in my reply to the previous
>> person that suggested that solution, not to mention in the initial
>> email which started the thread.
>> Summary: the overheads are simply not acceptable.
>> So far the only remotely helpful suggestion I have heard anywhere is
>> to keep two gigantic int[] arrays in memory, mapping the IDs in each
>> direction.  This would work if we had an infinite amount of memory to
>> play with, but unfortunately we don't.  1 billion item indexes are
>> expected to work, and we can't just tell everyone to buy 8 GB more RAM
>> when we update to the next version of our app.  If we were a
>> server-side app, *maybe* we could...
>> TX
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message