lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Can I use Lucene to retrieve a list of duplicates
Date Mon, 26 Feb 2007 22:08:27 GMT

what you are doing below is iterating over every term in your index, and
for each Term, recording if that term appears in more then one doc (using
IndexReader.document which is a really bad idea ingeneral in a loop like
this)

your orriginal problem description was " 'Find Duplicate records for
Column X' then I would filter the table to show only records where there
is more than one with the same value i.e duplicate for that column." ...

if we equate columns with fields, then what you are doing is certainly
overkill -- you can request a TermEnum that starts at a particular field
Or use the full TermEnum and "skipTo" the first possible temr in your
field) and then stop iterating when you get to a new field -- you can aso
get a lot better performance when compiling hte list of ROW_NUMBER field
values if you use a FieldCache instead of fetching each document.


: Date: Mon, 26 Feb 2007 16:25:11 +0000
: From: Paul Taylor <paul_t100@fastmail.fm>
: Reply-To: java-user@lucene.apache.org, paul_t100@fastmail.fm
: To: Erick Erickson <erickerickson@gmail.com>
: Cc: java-user@lucene.apache.org
: Subject: Re: Can I use Lucene to retrieve a list of duplicates
:
: Hi
:
: I  got it working before I saw your latest mail, the only problem is
: that it doesn't look very efficient. This is my duplicate method, the
: problem is that I have to enumerate through *every* term. This was worse
: before because I was only interested
: in terms that matched a particular field (column) but had enumerate
: through every term whatever field it was part of, so I recreated my
: index so that each document only contained a row number field, and a
: second field for the value of the column, however this means I am going
: to end up with a number of different indexes each solving a particular
: problem.
:
: paul
:
:  public List<Integer> getDuplicates()
:     {
:         List<Integer> matches = new ArrayList<Integer>();
:         try
:         {
:             IndexReader ir = IndexReader.open(directory);
:             TermEnum terms = ir.terms();
:             while (terms.next())
:             {
:                 if (terms.docFreq() > 1)
:                 {
:                     TermDocs termDocs = ir.termDocs(terms.term());
:                     while (termDocs.next())
:                     {
:                         Document d = ir.document(termDocs.doc());
:                         matches.add(new
: Integer(d.getField(ROW_NUMBER).stringValue()));
:                     }
:                 }
:             }
:
:         }
:         catch (IOException ioe)
:         {
:             ioe.printStackTrace();
:         }
:         return matches;
:     }
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message