lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ahmet Arslan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-5558) Add TruncateTokenFilter
Date Fri, 28 Mar 2014 01:03:43 GMT
Ahmet Arslan created LUCENE-5558:
------------------------------------

             Summary: Add TruncateTokenFilter
                 Key: LUCENE-5558
                 URL: https://issues.apache.org/jira/browse/LUCENE-5558
             Project: Lucene - Core
          Issue Type: New Feature
          Components: modules/analysis
    Affects Versions: 4.7
            Reporter: Ahmet Arslan
            Priority: Minor
             Fix For: 4.8


I am using this filter as a stemmer for Turkish language. In many academic research (classification,
retrieval) it is used and called as Fixed Prefix Stemmer or Simple Truncation Method or F5
in short.

Among F3 TO F7, F5 stemmer (length=5) is found to work well for Turkish language in [Information
Retrieval on Turkish Texts|http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf].
It is the same work where most of stopwords_tr.txt are acquired. 

ElasticSearch has [truncate|http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-truncate-tokenfilter.html]
filter but it does not respect keyword attribute. And it has a use case similar to TruncateFieldUpdateProcessorFactory

Main advantage of F5 stemming is : it does not effected by the meaning loss caused by ascii
folding. It is a diacritics-insensitive stemmer and works well with ascii folding. [Effects
of diacritics on Turkish information retrieval|http://journals.tubitak.gov.tr/elektrik/issues/elk-12-20-5/elk-20-5-9-1010-819.pdf]

Here is the full field type I use for "diacritics-insensitive search" for Turkish
{code:xml}
 <fieldType name="text_tr_ascii_f5" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.ApostropheFilterFactory"/>
     <filter class="solr.TurkishLowerCaseFilterFactory"/>
     <filter class="solr.ASCIIFoldingFilterFactory"/>
     <filter class="solr.KeywordRepeatFilterFactory"/>
     <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>
{code}

I  would like to get community opinions :

1) Any interest in this? 
2) keyword attribute should be respected? 
3) package name analysis.misc versus analyis.tr 
4) name of the class TruncateTokenFilter versus FixedPrefixStemFilter



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message