lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikumar Govindarajan <>
Subject Re: App supplied docID in lucene possible?
Date Fri, 02 Nov 2012 11:25:25 GMT
I am aware of ExternalFileField, but docID solution looks more elegant and

Our re-indexing rate daily is around 35-40% of index additions.

When a small int-value/boolean value in a lucene document changes, I need
to re-index an entire 5-10MB content again. This is the reason why I am
looking for manipulating docId of lucene

In our case, sorting can be fully eliminated if lucene facilitates app
supplied docId. Early query termination should also be possible with such
an approach

I know that IndexReader, SegmentMerge and IndexWriter will get affected. I
would like to know what other areas of lucene get affected because of such
an approach


On Thu, Oct 25, 2012 at 8:20 PM, Jack Krupansky <>wrote:

> Have you looked at or decided against an approach like Solr's
> ExternalFileField?
> See:
> solr/schema/ExternalFileField.**html<>
> Is that at least the kind of issue you are trying to deal with?
> One final question: How much of a document's field values are stable vs.
> frequently changing? What are the numbers here - total field count, count
> of frequently changed fields, and percentage of documents being updated in
> some period of time?
> And, I don't quite follow why you can't just use a unique key for a
> document rather than the low-level Lucene document id.
> -- Jack Krupansky
> -----Original Message----- From: Ravikumar Govindarajan
> Sent: Thursday, October 25, 2012 6:10 AM
> To:
> Subject: App supplied docID in lucene possible?
> We have the need to re-index some fields in our application frequently.
> Our typical document consists of
> a) Many single-valued {long/int} re-indexable fields
> b) Few large-valued {text/string} static fields
> We have to re-index an entire document if a single smallish field changes
> and it is turning out to be a problem for us. I have gone through the
where it tries
> to work-around this limitation using a secondary mapping of new-old docids.
> As I understand, lucene strictly maintains internal doc-id order so that
> many queries that depend on it, will work correctly. Segment merges will
> also maintain order as well as reclaim deleted doc-ids
> There should be many applications like us, which manage index shards
> limiting a given shard based on doc-id limits or size. So reclaiming
> deleted doc-ids is mostly a non-issue for us.
> That leaves us with changing doc-ids. How about leaving open the doc-ids
> themselves to the applications, at-least as an option to the needy? Taking
> such an approach might inter-leave doc-ids across segments, but within a
> segment, the docIds are always in increasing order. There are possibilities
> of ghost-deletes, duplicate docIds etc..., but all should be solvable, I
> believe.
> Fronting these doc-ids during search from all segment readers and returning
> the correct value from one of them should be easy. Will it incur a heavy
> penalty during search? Another advantage gained, is the triviality of
> cross-joining indexes when docIDs are fixed.
> There must be many other places where an app supplied docId might make
> lucene behave funny. Need some help in identifying those areas at least for
> understanding this problem correctly, if not solving it all together.
> --
> Ravi
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**<>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message