lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chuck Williams <ch...@allthingslocal.com>
Subject Re: globally unique field
Date Tue, 19 Apr 2005 23:56:14 GMT
Mike Baranczak wrote:

> First of all, a big thanks to all the Lucene hackers - I've only been 
> using your product for a couple of weeks, and I've been very impressed 
> by what I've seen.
>
> Here's my question: I have an index with a little over 3 million 
> documents in it, with more on the way. Each document has an "URL" 
> field (which is not indexed). I want to guarantee that each URL is 
> unique; that is, when I'm adding a new document, I have to check if 
> another existing document has the same value for the URL field. What's 
> the best way to do it? I can think of two possible approaches:
>
> 1 - Open an IndexReader and iterate over all the Documents that it 
> contains, checking the value of the "URL" field for each Document. 
> This seems a little inefficient, since I only care about one field, 
> and I don't want to have to retrieve all of the fields.
>
> 2 - Rebuild the index such that the URL field is indexed. Then, I 
> could just do a normal search for the value of the URL. But since the 
> URL field will never be searched under any other circumstances, this 
> seems like kind of a waste of disk space.
>
> I'm sure somebody else has had to do something like this before. Is 
> there a better way to do it than what I've described above? If not, 
> then which of the two approaches will give me the best results?

I'd suggest putting the URL into the index, untokenzied, and then use 
IndexReader.terms(<Term for you new URL>). This will seek to precisely 
the URL you want or the one just after it, and do it quickly. This will 
be orders of magnitude faster than approach 1, and much faster than 
using the search API if that is what you intended by approach 2. The 
disk space should not be noticeable unless your documents are very very 
small.

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message