lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: App supplied docID in lucene possible?
Date Thu, 25 Oct 2012 14:50:07 GMT
Have you looked at or decided against an approach like Solr's 
ExternalFileField?

See:
http://lucene.apache.org/solr/4_0_0/solr-core/org/apache/solr/schema/ExternalFileField.html

Is that at least the kind of issue you are trying to deal with?

One final question: How much of a document's field values are stable vs. 
frequently changing? What are the numbers here - total field count, count of 
frequently changed fields, and percentage of documents being updated in some 
period of time?

And, I don't quite follow why you can't just use a unique key for a document 
rather than the low-level Lucene document id.

-- Jack Krupansky

-----Original Message----- 
From: Ravikumar Govindarajan
Sent: Thursday, October 25, 2012 6:10 AM
To: java-user@lucene.apache.org
Subject: App supplied docID in lucene possible?

We have the need to re-index some fields in our application frequently.

Our typical document consists of

a) Many single-valued {long/int} re-indexable fields
b) Few large-valued {text/string} static fields

We have to re-index an entire document if a single smallish field changes
and it is turning out to be a problem for us. I have gone through the
https://issues.apache.org/jira/browse/LUCENE-3837 proposal where it tries
to work-around this limitation using a secondary mapping of new-old docids.

As I understand, lucene strictly maintains internal doc-id order so that
many queries that depend on it, will work correctly. Segment merges will
also maintain order as well as reclaim deleted doc-ids

There should be many applications like us, which manage index shards
limiting a given shard based on doc-id limits or size. So reclaiming
deleted doc-ids is mostly a non-issue for us.

That leaves us with changing doc-ids. How about leaving open the doc-ids
themselves to the applications, at-least as an option to the needy? Taking
such an approach might inter-leave doc-ids across segments, but within a
segment, the docIds are always in increasing order. There are possibilities
of ghost-deletes, duplicate docIds etc..., but all should be solvable, I
believe.

Fronting these doc-ids during search from all segment readers and returning
the correct value from one of them should be easy. Will it incur a heavy
penalty during search? Another advantage gained, is the triviality of
cross-joining indexes when docIDs are fixed.

There must be many other places where an app supplied docId might make
lucene behave funny. Need some help in identifying those areas at least for
understanding this problem correctly, if not solving it all together.

--
Ravi 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message