lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ion Badita <ion.bad...@searchcapital.net>
Subject Re: Unique Fields
Date Thu, 13 Mar 2008 07:24:54 GMT
My unique is more like synonym. For instance: Brain cancer, Cancer of 
the brain, Brain neoplasm, are the same, so i need to tokenize the title 
remove the stop words etc.

I have a problem with the indexing... with a new title first i have to 
search in the index, if the title is not found write it to the index, 
flush the indexwriter and recreate the searcher (to reflect the 
changes). Is there a way to avoid recreating the Searcher/Reader for 
every new title inserted into the index?


Thank you
John




Erick Erickson wrote:
> So, you're tokenizing the title field? If so, I don't understand how you
> expect
> this to work. Would the title "this is one order" and "is one order this" be
> considered
> identical? Would capitalization matter? Punctuation? Throwing all the terms
> of a title into a tokenized field and expecting some magic to keep
> duplicates
> is beyond the scope of Lucene, you'll have to roll some customized solution.
>
> For instance, index your title UN_TOKENIZED in a duplicate field (after
> applying
> whatever massaging you want re: punctuation, spaces, etc.). Use
> TermDocs/TermEnum
> on that field to detect duplicates. You won't search on this field....
>
> Or create a hash of the title and index *that* in a separate field and check
> against
> the hash with termenum/terndocs. Or.....
>
> But no, there's no magic that makes Lucene DWIM (Do What I Mean)...
>
> Best
> Erick
>
> On Wed, Mar 12, 2008 at 2:01 AM, Ion Badita <ion.badita@searchcapital.net>
> wrote:
>
>   
>> The "problem" is that my unique field is a title, many terms per field.
>> I want to make an index with titles and i don't want to have duplicates.
>>
>> John
>>
>>
>> Erick Erickson wrote:
>>     
>>> You can easily find whether a term is in the index with
>>>       
>> TermEnum/TermDocs
>>     
>>> (I think TermEnum is all you really need).
>>>
>>> Except, you'll probably also have to keep an internal map of IDs added
>>>       
>> since
>>     
>>> the searcher was opened and check against that too.
>>>
>>> Best
>>> Erick
>>>
>>> On Tue, Mar 11, 2008 at 11:04 AM, Ion Badita <
>>>       
>> ion.badita@searchcapital.net>
>>     
>>> wrote:
>>>
>>>
>>>       
>>>> Hi,
>>>>
>>>> I want to create an index with one unique field.
>>>> Before inserting a document i must be sure that "unique field" is
>>>>         
>> unique.
>>     
>>>>
>>>> John
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>         
>>>       
>>     
>
>   


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message