lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-41) PATCH: HyphenatedWordsFilter, Factory and test
Date Fri, 28 Jul 2006 15:45:14 GMT
    [ http://issues.apache.org/jira/browse/SOLR-41?page=comments#action_12424112 ] 
            
Yonik Seeley commented on SOLR-41:
----------------------------------

Thanks Boris!

A common problem when creating new tokens is losing existing position increments.
I recently changed Lucene's Token class so that it's cloneable and you can change the text
with setTermText().

So you may want to just change the text of the first token rather than creating a new one.

> PATCH: HyphenatedWordsFilter, Factory and test
> ----------------------------------------------
>
>                 Key: SOLR-41
>                 URL: http://issues.apache.org/jira/browse/SOLR-41
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Boris Vitez
>            Priority: Minor
>         Attachments: HyphenatedWordsFilter.java, hyphenatedwordsfilter.patch, HyphenatedWordsFilterFactory.java,
TestHyphenatedWordsFilter.java
>
>
> When the plain text is extracted from documents, we will often have many words hyphenated
and broken into two lines. This is often the case with documents where narrow text columns
are used, such as newsletters.
> In order to increase searching efficiency, this filter unites hyphenated words broken
in two lines.
> This filter has to be used together with the WordDelimiterFilter having catenateWords=1.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message