lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Index Dedupe
Date Tue, 02 Oct 2007 13:32:00 GMT
Here's a couple of fragments, alter to suit....
    public void doRemove(Directory dir) throws Exception
    {

      this.reader = IndexReader.open(dir);

        TermEnum theTerms = this.reader.terms(new Term("unique_field", ""));

        Term term = null;

        do {
            term = theTerms.term();

            if ((term == null) || ! term.field().equalsIgnoreCase("doc_id"))
{
                break;
            }

            if (theTerms.docFreq() > 1) {
                this.removeDupsForTerm(term);
            }
        } while (theTerms.next());
}


   private void removeDupsForTerm(Term term) throws Exception
    {
        TermDocs td = this.reader.termDocs(term);
        for ( int idx = 0; td.next(); ++idx) {
            if (idx > 0) {
                this.reader.deleteDocument(td.doc());
           }
        }
    }
On 10/2/07, Johnny R. Ruiz III <jorzi@yahoo.com> wrote:
>
> Hi Daniel,
>
> Tnx, but forgive my ignorance..  can u give me a sample code to do it
> :).   I have never used termDocs() before.
>
> Tnx,
> Johnny
>
> ----- Original Message ----
> From: Daniel Noll <daniel@nuix.com>
> To: java-user@lucene.apache.org
> Sent: Tuesday, October 2, 2007 12:00:07 PM
> Subject: Re: Index Dedupe
>
> On Tuesday 02 October 2007 12:25:47 Johnny R. Ruiz III wrote:
> > Hi,
> >
> > I can't seem to find a way to delete duplicate in lucene index.  I
> hve  a
> > unique key so it seems to be straight forward.  But I can't find a
> simple
> > way  to do it except for putting  each record in the index into HashMap.
> > Are there any method in lucene package that I could use?
>
> I would use termDocs() to iterate all the terms in that field.  Then skip
> the
> first doc for each term and delete all subsequent ones.
>
> Daniel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
>
>
>
>
>
> ____________________________________________________________________________________
> Need a vacation? Get great deals
> to amazing places on Yahoo! Travel.
> http://travel.yahoo.com/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message