lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikumar Govindarajan <ravikumar.govindara...@gmail.com>
Subject App supplied docID in lucene possible?
Date Thu, 25 Oct 2012 10:10:03 GMT
We have the need to re-index some fields in our application frequently.

Our typical document consists of

a) Many single-valued {long/int} re-indexable fields
b) Few large-valued {text/string} static fields

We have to re-index an entire document if a single smallish field changes
and it is turning out to be a problem for us. I have gone through the
https://issues.apache.org/jira/browse/LUCENE-3837 proposal where it tries
to work-around this limitation using a secondary mapping of new-old docids.

As I understand, lucene strictly maintains internal doc-id order so that
many queries that depend on it, will work correctly. Segment merges will
also maintain order as well as reclaim deleted doc-ids

There should be many applications like us, which manage index shards
limiting a given shard based on doc-id limits or size. So reclaiming
deleted doc-ids is mostly a non-issue for us.

That leaves us with changing doc-ids. How about leaving open the doc-ids
themselves to the applications, at-least as an option to the needy? Taking
such an approach might inter-leave doc-ids across segments, but within a
segment, the docIds are always in increasing order. There are possibilities
of ghost-deletes, duplicate docIds etc..., but all should be solvable, I
believe.

Fronting these doc-ids during search from all segment readers and returning
the correct value from one of them should be easy. Will it incur a heavy
penalty during search? Another advantage gained, is the triviality of
cross-joining indexes when docIDs are fixed.

There must be many other places where an app supplied docId might make
lucene behave funny. Need some help in identifying those areas at least for
understanding this problem correctly, if not solving it all together.

--
Ravi

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message