lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1869) RemoveDuplicatesTokenFilter doest have expected behaviour
Date Wed, 07 Apr 2010 20:16:33 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854676#action_12854676
] 

Robert Muir commented on SOLR-1869:
-----------------------------------

Joe, the initialization is the same. I simply prefer to do this right where the attribute
is declared, rather than doing it in the ctor (its the same in java!). So this is no problem.

as far as the behavior, the filter is currently correct:
{noformat}
A TokenFilter which filters out Tokens at the same position and Term text as the previous
token in the stream.
{noformat}

if you want to instead create a filter that removes duplicates across an entire field, this
is really a completely different filter, but it sounds like a useful completely different
filter!

Can you instead create a patch for a separate filter with a different name?

I think you can start with this patch, but there are a number of issues with this patch though:
* the map/set is never cleared, so it won't work across reusable tokenstreams. The map/set
should be cleared in reset()
* i would use chararrayset instead of this map, like the current RemoveDuplicatesTokenFilter


> RemoveDuplicatesTokenFilter doest have expected behaviour
> ---------------------------------------------------------
>
>                 Key: SOLR-1869
>                 URL: https://issues.apache.org/jira/browse/SOLR-1869
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>            Reporter: Joe Calderon
>            Priority: Minor
>         Attachments: SOLR-1869.patch
>
>
> the RemoveDuplicatesTokenFilter seems broken as it initializes its map and attributes
at the class level and not within its constructor
> in addition i would think the expected behaviour would be to remove identical terms with
the same offset positions, instead it looks like it removes duplicates based on position increment
which wont work when using it after something like the edgengram filter. when i posted this
to the mailing list even erik hatcher seemed to think thats what this filter was supposed
to do...
> attaching a patch that has the expected behaviour and initializes variables in constructor

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message