lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Nothman (JIRA)" <>
Subject [jira] [Created] (SOLR-4017) Signatures for deduplication should be Analyzers
Date Tue, 30 Oct 2012 11:28:12 GMT
Joel Nothman created SOLR-4017:

             Summary: Signatures for deduplication should be Analyzers
                 Key: SOLR-4017
             Project: Solr
          Issue Type: Improvement
          Components: update
    Affects Versions: 4.0
         Environment: N/A
            Reporter: Joel Nothman

At present, signatures for deduplication are constructed from the raw text of a specified
set of fields. This means they may not take advantage of the normalization provided by Analyzers:
stripping of HTML, tokenization, diacritic normalization, stemming or stop-removal, etc. It
would also allow a token-based signature like the TextProfileSignature to consider character
or token ngrams where appropriate.

Instead of handling this task with a special SignatureUpdateProcessorFactory, it seems one
could do (almost) the same with CloneFieldUpdateProcessorFactory, and the appropriate *SignatureAnalyzer
which outputs a single (or indeed, multiple!) Term: a hash. (I am not familiar enough to know
whether the {{overwriteDupes}} option would require a further UpdateProcessor.)

The current approach may be more efficient for most cases, so could be retained for efficiency

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message