lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sonfon" <son...@gmail.com>
Subject Re: Search while indexing
Date Sun, 08 Mar 2009 04:35:09 GMT
Erick:
    Thanks for your suggestion. I think another solution would be keeping an list of keywords
that could uniquely identify a document in a database, and search for keywords before adding
a new document. As querying database is fast, this probaly wouldn't cost much time. But this
would request maintaining a database while indexing. I just wondered if lucene offers an interface
identifying duplicates. I think identifying duplicate URLs when indexing web would be common.

    Best
Wishes.
----- Original Message ----- 
From: "Erick Erickson" <erickerickson@gmail.com>
To: <java-user@lucene.apache.org>
Sent: Saturday, March 07, 2009 10:58 PM
Subject: Re: Search while indexing


> First, you'll probably want to search the user list archive for this issue,
> as
> it's been discussed and you'll find more information than I can remember
> off the top of my head. That said:
> 
> 1> changes to an index are not visible until you reopen the reader. You
>     probably have to flush the writer in the meantime. And this will
>    be costly to do for every document.
> 
> 2> How do you identify duplicates? If it's a short enough signature,
>      you could consider keeping an in-memory list and check that
>      while indexing. If you needed to update your index you could
>      simply use TermEnum/TermDocs to read all the values into
>      memory before adding to it.
> 
> 3> You could consider using some kind of calculated signature of
>     the whole file for your key, but that may not suit your app.
> 
> Best
> Erick
> 
> 
> 
> On Sat, Mar 7, 2009 at 12:21 AM, sonfon <sonfon@gmail.com> wrote:
> 
>> Dear All,
>>    Now, I'm considering to build index for my application with lucene.
>> However, as the document sources I'm going to index has many duplications,
>> so before adding a document to an IndexWriter, I hope search in the index
>> database first to see if a same document copy has already been added. I used
>> IndexSearcher to search the same Dir while IndexWriter writing to it.
>> However, it seem that IndexSearcher returned no result though I'm sure there
>> are duplicate copies indexed already. And after the indexing procedure, I
>> can get the search results, so I'm sure I didn't write the wrong code.
>> Anyone could offer some help? Some example codes are appreciated.
>>    Best
>> Wishes.
>
Mime
View raw message