lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Nothman (JIRA)" <>
Subject [jira] [Updated] (SOLR-4017) Signatures for deduplication should be Analyzers
Date Mon, 05 Nov 2012 14:25:02 GMT


Joel Nothman updated SOLR-4017:

    Priority: Minor  (was: Major)
> Signatures for deduplication should be Analyzers
> ------------------------------------------------
>                 Key: SOLR-4017
>                 URL:
>             Project: Solr
>          Issue Type: Improvement
>          Components: update
>    Affects Versions: 4.0
>         Environment: N/A
>            Reporter: Joel Nothman
>            Priority: Minor
> At present, signatures for deduplication are constructed from the raw text of a specified
set of fields. This means they may not take advantage of the normalization provided by Analyzers:
stripping of HTML, tokenization, diacritic normalization, stemming or stop-removal, etc. It
would also allow a token-based signature like the TextProfileSignature to consider character
or token ngrams where appropriate.
> Instead of handling this task with a special SignatureUpdateProcessorFactory, it seems
one could do (almost) the same with CloneFieldUpdateProcessorFactory, and the appropriate
*SignatureAnalyzer which outputs a single (or indeed, multiple!) Term: a hash. (I am not familiar
enough to know whether the {{overwriteDupes}} option would require a further UpdateProcessor.)
> The current approach may be more efficient for most cases, so could be retained for efficiency

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message