lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Is there a way to check for field "uniqueness" when indexing?
Date Thu, 13 Aug 2009 15:00:35 GMT
In 2.9 there will be - IndexWriter#getReader().

BTW, note that even if someone deletes, your reader may not see this delete.
If you use IndexWriter to delete docs, the open reader won't see those
deletes. So you may still have a problem.

I don't know how much stuff users can index, and how often you plan to
commit. But I'd like to think humans can't generate that much traffic in a
short period of time (say every couple of minutes). If they do, you might
want to commit anyway, so they'll be able to search on their newly added
data?

Anyway, I'll give it some more thought.

Shai

On Thu, Aug 13, 2009 at 5:52 PM, Daniel Shane <shaned@lexum.umontreal.ca>wrote:

> Users can index really a lot of stuff, so I'd like not to keep things in
> memory for too long.
>
> Even if I keep a set of things added, how do I know if something has been
> deleted via a delete? It seems rather difficult to keep this set of
> documents added in sync with the index reader on the index (before it has
> been written to).
>
> What I'd like is to have an access to the stuff the index writer has
> written but not yet commited. Is there something that can access that data?
>
> Daniel Shane
>
>
> Shai Erera wrote:
>
>> How many documents do you index between you refresh a reader? If it's not
>> too much, I'd keep a Set of those terms and check every incoming document
>> in
>> the set and then the reader.
>>
>> Note that the set keeps only just the terms of those documents your reader
>> doesn't see. You should clear() it after you've refreshed your reader.
>>
>> In 2.9, IndexWriter will expose a getReader(), so you might be able to use
>> it, by checking on its reader and the on disk reader.
>>
>> If it's possible, I think I'd prefer the first approach.
>>
>> Shai
>>
>> On Thu, Aug 13, 2009 at 5:33 PM, Daniel Shane <shaned@lexum.umontreal.ca
>> >wrote:
>>
>>
>>
>>> Hi all!
>>>
>>> I'm currently running a big lucene index and one of my main concerns is
>>> the
>>> integrity of the data entered. A few things come to mind, like enforcing
>>> that certain fields be non-blank, forcing certain formats etc...
>>>
>>> All these validations are easy to do with lucene, since I can validate
>>> the
>>> document before it is indexed or when it is retrieved.
>>>
>>> The thing however that I have a hard time with, is field uniquness.
>>>
>>> Lets say I have a field and I really want it to be unique. I can't seem
>>> to
>>> find out how to do it during the indexation phase since everything that
>>> is
>>> added to the index is not readable by an index reader until the index is
>>> closed.
>>>
>>> Add to that the fact that items can be deleted from the index during the
>>> indexation and the only way I have to figure uniquness is to check every
>>> unique field values using termEnums and checking for docFreq.
>>>
>>> This has a major disadvantage that I cannot inform people who are using
>>> the
>>> library of the unique conflit when it happens, only when the index is
>>> closed.
>>>
>>> Does anyone have an idea on how I could check an index that is in the
>>> process of being indexed (things added, things deleted) for the uniquess
>>> of
>>> a given field *at the time I index a document* ?
>>>
>>> Daniel Shane
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message