lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <paul_t...@fastmail.fm>
Subject Re: Can I use Lucene to retrieve a list of duplicates
Date Tue, 27 Feb 2007 10:05:39 GMT

Thanks Eric/Chris for all your help

Yes the missing magic was:  

TermDocs termDocs = ir.termDocs(terms.term());

instead of 

TermDocs termDocs = ir.termDocs();

I will try and use FieldCache as well.

This is the first time Ive used Lucene, but it certainly looks like it will help me with a
number of areas in my application :)

paul

Chris Hostetter wrote:
> what you are doing below is iterating over every term in your index, and
> for each Term, recording if that term appears in more then one doc (using
> IndexReader.document which is a really bad idea ingeneral in a loop like
> this)
>
> your orriginal problem description was " 'Find Duplicate records for
> Column X' then I would filter the table to show only records where there
> is more than one with the same value i.e duplicate for that column." ...
>
> if we equate columns with fields, then what you are doing is certainly
> overkill -- you can request a TermEnum that starts at a particular field
> Or use the full TermEnum and "skipTo" the first possible temr in your
> field) and then stop iterating when you get to a new field -- you can aso
> get a lot better performance when compiling hte list of ROW_NUMBER field
> values if you use a FieldCache instead of fetching each document.
>
>
> : Date: Mon, 26 Feb 2007 16:25:11 +0000
> : From: Paul Taylor <paul_t100@fastmail.fm>
> : Reply-To: java-user@lucene.apache.org, paul_t100@fastmail.fm
> : To: Erick Erickson <erickerickson@gmail.com>
> : Cc: java-user@lucene.apache.org
> : Subject: Re: Can I use Lucene to retrieve a list of duplicates
> :
> : Hi
> :
> : I  got it working before I saw your latest mail, the only problem is
> : that it doesn't look very efficient. This is my duplicate method, the
> : problem is that I have to enumerate through *every* term. This was worse
> : before because I was only interested
> : in terms that matched a particular field (column) but had enumerate
> : through every term whatever field it was part of, so I recreated my
> : index so that each document only contained a row number field, and a
> : second field for the value of the column, however this means I am going
> : to end up with a number of different indexes each solving a particular
> : problem.
> :
> : paul
> :
> :  public List<Integer> getDuplicates()
> :     {
> :         List<Integer> matches = new ArrayList<Integer>();
> :         try
> :         {
> :             IndexReader ir = IndexReader.open(directory);
> :             TermEnum terms = ir.terms();
> :             while (terms.next())
> :             {
> :                 if (terms.docFreq() > 1)
> :                 {
> :                     TermDocs termDocs = ir.termDocs(terms.term());
> :                     while (termDocs.next())
> :                     {
> :                         Document d = ir.document(termDocs.doc());
> :                         matches.add(new
> : Integer(d.getField(ROW_NUMBER).stringValue()));
> :                     }
> :                 }
> :             }
> :
> :         }
> :         catch (IOException ioe)
> :         {
> :             ioe.printStackTrace();
> :         }
> :         return matches;
> :     }
> :
> : ---------------------------------------------------------------------
> : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> : For additional commands, e-mail: java-user-help@lucene.apache.org
> :
>
>
>
> -Hoss
>
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message