lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Shane <sha...@LEXUM.UMontreal.CA>
Subject Re: Is there a way to check for field "uniqueness" when indexing?
Date Thu, 13 Aug 2009 14:52:37 GMT
Users can index really a lot of stuff, so I'd like not to keep things in 
memory for too long.

Even if I keep a set of things added, how do I know if something has 
been deleted via a delete? It seems rather difficult to keep this set of 
documents added in sync with the index reader on the index (before it 
has been written to).

What I'd like is to have an access to the stuff the index writer has 
written but not yet commited. Is there something that can access that data?

Daniel Shane

Shai Erera wrote:
> How many documents do you index between you refresh a reader? If it's not
> too much, I'd keep a Set of those terms and check every incoming document in
> the set and then the reader.
>
> Note that the set keeps only just the terms of those documents your reader
> doesn't see. You should clear() it after you've refreshed your reader.
>
> In 2.9, IndexWriter will expose a getReader(), so you might be able to use
> it, by checking on its reader and the on disk reader.
>
> If it's possible, I think I'd prefer the first approach.
>
> Shai
>
> On Thu, Aug 13, 2009 at 5:33 PM, Daniel Shane <shaned@lexum.umontreal.ca>wrote:
>
>   
>> Hi all!
>>
>> I'm currently running a big lucene index and one of my main concerns is the
>> integrity of the data entered. A few things come to mind, like enforcing
>> that certain fields be non-blank, forcing certain formats etc...
>>
>> All these validations are easy to do with lucene, since I can validate the
>> document before it is indexed or when it is retrieved.
>>
>> The thing however that I have a hard time with, is field uniquness.
>>
>> Lets say I have a field and I really want it to be unique. I can't seem to
>> find out how to do it during the indexation phase since everything that is
>> added to the index is not readable by an index reader until the index is
>> closed.
>>
>> Add to that the fact that items can be deleted from the index during the
>> indexation and the only way I have to figure uniquness is to check every
>> unique field values using termEnums and checking for docFreq.
>>
>> This has a major disadvantage that I cannot inform people who are using the
>> library of the unique conflit when it happens, only when the index is
>> closed.
>>
>> Does anyone have an idea on how I could check an index that is in the
>> process of being indexed (things added, things deleted) for the uniquess of
>> a given field *at the time I index a document* ?
>>
>> Daniel Shane
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message