lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <>
Subject [jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
Date Wed, 08 Oct 2008 19:45:44 GMT


Mark Miller commented on SOLR-799:

bq.    I agree that it is wise to separate the detection of duplication from the handling
of found duplicates

bq. Though in some implementations (like #2, which may be the default), detecting that duplicate
and handling it are truly coupled... forcing a decoupling would not be a good thing in that

Still looking at this. Was hoping to avoid any of the old 'if solr crashes you can have 2
docs with same id in the index' type stuff. Guess I won't easily get away with that <g>
Hopefully we can make it so the default implementation can still be as efficient and atomic.

bq. How should different "types" be handled (for example when we support binary fields). For
example, different base64 encoders might use different line lengths or different line endings
(CR/LF). Perhaps it's good enough to say that the string form must be identical, and leave
it at that for now? The alternative would be signatures based on the Lucene Document about
to be indexed.

Yeah, may be best to worry about it when we support binary fields...would be nice to look
forward though. I think returning a byte[] rather than a String will future proof the sig
implementations a bit along those lines (though doesn't address your point)...still mulling
- this shouldn't trip up Fuzzy hashing implementations to much, and so how exact should MD5Signature

bq.     *  It appears that if you put fields in a different order that the signature will
bq.     * It appears that documents with different field names but the same content will have
the same signature.

Two good points I have addressed.

bq. It would be nice to be able to calculate a signature for a document w/o having to catenate
all the fields together.
Perhaps change calculate(String content) to something like calculate(Iterable<CharSequence>

I like the idea of incremental as well.

bq. I don't understand the dedup logic in DUH2... it seems like we want to delete by id and
by sig... unfortunately there is no
IndexWriter.updateDocument(Term[] terms, Document doc) so we'll have to do a separate non-atomic
delete on the sig for now, right?

Another one I was hoping to get away with. My current strategy was to say that setting an
update term means that updating by id is overridden and *only* the update Term is used - effectively,
the update Term (signature) becomes the update id - and you can control whether the id factors
into that update signature or not.  Didn't get passes the goalie I suppose <g> I guess
I give up on clean atomic imp and perhaps investigate update(terms[], doc) for the future.
I wanted to deal with both signature and id, but figured its best to start with most efficient
bare bones and work out.

bq. There's probably no need for a separate test solrconfig-deduplicate.xml if all it adds
is an update processor. Tests could just explicitly specify the update handler on updates.

Its mainly for me at the moment (testing config settings loading and what not), I'll be sure
to pull it once the patch is done.

Thanks for all of the feedback.

> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>                 Key: SOLR-799
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: SOLR-799.patch
> Hash based duplicate document detection is efficient and allows for blocking as well
as field collapsing. Lets put it into solr. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message