lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <paul_t...@fastmail.fm>
Subject Re: Can I use Lucene to retrieve a list of duplicates
Date Mon, 26 Feb 2007 16:25:11 GMT
Hi

I  got it working before I saw your latest mail, the only problem is 
that it doesn't look very efficient. This is my duplicate method, the 
problem is that I have to enumerate through *every* term. This was worse 
before because I was only interested
in terms that matched a particular field (column) but had enumerate 
through every term whatever field it was part of, so I recreated my 
index so that each document only contained a row number field, and a 
second field for the value of the column, however this means I am going 
to end up with a number of different indexes each solving a particular 
problem.

paul

 public List<Integer> getDuplicates()
    {
        List<Integer> matches = new ArrayList<Integer>();
        try
        {
            IndexReader ir = IndexReader.open(directory);
            TermEnum terms = ir.terms();
            while (terms.next())
            {
                if (terms.docFreq() > 1)
                {
                    TermDocs termDocs = ir.termDocs(terms.term());
                    while (termDocs.next())
                    {
                        Document d = ir.document(termDocs.doc());
                        matches.add(new 
Integer(d.getField(ROW_NUMBER).stringValue()));
                    }
                }
            }

        }
        catch (IOException ioe)
        {
            ioe.printStackTrace();
        }
        return matches;
    }

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message