lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Can I use Lucene to retrieve a list of duplicates
Date Mon, 26 Feb 2007 13:10:03 GMT
Here's an excerpt from something I wrote to enumerate all the terms for a
field. I hacked out some of my tracing, so it may not even compile <G>.....

Basically, change the line "if (td.next())" to "while (td.next())" and every
time you stay in that loop for more than one cycle, you'll have duplicates
for that particular term....

  private void enumField(String field) throws Exception
    {
        long start = System.currentTimeMillis();
        TermEnum termEnum = this.reader.getIndexReader().terms(new
Term(field, ""));

        this.writer.println("");
        this.writer.println("");
        this.writer.println("");
        this.writer.println("Values for term " + field);

        TermDocs td = this.reader.getIndexReader().termDocs();
        Term term = termEnum.term();
        int idx = 0;
        int jdx = 0;

        while ((term != null) && term.field().equals(field)) {

            termEnum.next();
            td.seek(termEnum);

            if (td.next()) {
                ++jdx;
            }

            term = termEnum.term();
            ++idx;
        }
    }


Erick

On 2/26/07, Paul Taylor <paul_t100@fastmail.fm> wrote:
>
> Hi,
>
> Sorry I don't see how I get access to TermEnums. So far Ive created a
> document per row, the first field holds the row id, then i have one
> field per column, and checked  the index has been created ok with some
> search querys.
> I now want to pass a column to check, and receive  a list of all the
> documents that contain  a  term  in that column which is used by at
> least one other document for that column ( a duplicate term).
>
> thanks paul
>
> Chris Hostetter wrote:
> > : Thanks this might do it, but do I need to know the terms beforehand, I
> > : just want to return any terms with frequency more than one?
> >
> > no, TermEnum will let you iterate over all the terms ... you don't even
> > need TermDocs if you just want the docFreq for each term (which would be
> 1
> > if there are no duplicates)
> >
> > : Erick Erickson wrote:
> > : > Sure, you can use the TermDocs/TermEnum classes. Basically, for a
> term
> > : > (probably column value in your app) these let you quickly answer the
> > : > question "which (and how many) documents does this term appear in".
> > : > What you get is the Lucene doc id, which let's you fetch all the
> > : > information about the documents you want.
> > : >
> > : > Erick
> > : >
> > : > On 2/23/07, *Paul Taylor* <paul_t100@fastmail.fm
> > : > <mailto:paul_t100@fastmail.fm>> wrote:
> > : >
> > : >     Hi I have Java Swing application with a table, I was considering
> using
> > : >     Lucene to index the data in the table. One task Id like to do is
> > : >     for the
> > : >     user to select 'Find Duplicate records for Column X', then I
> would
> > : >     filter the table to show only records where there is more than
> one
> > : >     with
> > : >     the same value i.e duplicate for that column. Is there a way to
> return
> > : >     all the duplicates from a Lucene index.
> > : >
> > : >     thanks paul Taylor
> > : >
> > : >
> ---------------------------------------------------------------------
> > : >     To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > : >     <mailto:java-user-unsubscribe@lucene.apache.org>
> > : >     For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > : >     <mailto:java-user-help@lucene.apache.org>
> > : >
> > : >
> > : >
> ------------------------------------------------------------------------
> > : >
> > : > Internal Virus Database is out-of-date.
> > : > Checked by AVG Free Edition.
> > : > Version: 7.1.394 / Virus Database: 268.16.5/616 - Release Date:
> 04/01/2007
> > : >
> > :
> > :
> > : ---------------------------------------------------------------------
> > : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > : For additional commands, e-mail: java-user-help@lucene.apache.org
> > :
> >
> >
> >
> > -Hoss
> >
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message