lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard \"Trey\" Hyde (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-11) DeDupTokenFilter{Factory}
Date Tue, 09 May 2006 19:32:05 GMT
    [ http://issues.apache.org/jira/browse/SOLR-11?page=comments#action_12378708 ] 

Richard "Trey" Hyde commented on SOLR-11:
-----------------------------------------

re algorith,: Yes, it would not be advisable for large fields.   
 I'm not all that familiar with the subtleties of position but for at least some of my data,
I am geting duplicates in other posistions.

> DeDupTokenFilter{Factory}
> -------------------------
>
>          Key: SOLR-11
>          URL: http://issues.apache.org/jira/browse/SOLR-11
>      Project: Solr
>         Type: Wish

>   Components: search
>     Reporter: Hoss Man
>  Attachments: solr.analysis.RemoveDuplicateTokensFilter.java
>
> I recently noticed a situation in which my Query analyzer was producing the same Token
more then once, resulting in it getting two equally boosted clauses in the resulting query.
 In my specific case, i was using the same synonym file for multiple fields (some stemmed
some not) and two synonyms for a word stemmed to the same root, which ment that particular
word was worth twice as as any of the other variations of the synonym -- but I can imagine
other situations where this might come up, both at index time and at query time, particularlay
when using SynonymFilter in combination with the WordDelimiter filter.
> It occured to me that a DeDupFilter would be handy.  In it's simplest form it would drop
any Token it gets where the startOffset, endOffset,termText,and type are all identical to
the previous token and the positionIncriment is 0.  A more robust implimentation might support
init options indicating that only certain combinations of those things should be used to determine
equality (ie: just termText, just termText and positionIncriment=0, etc...) but in this case,
an option might also be neccessary to determine with of the Tokens should be propogated (the
first of the last)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message