lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration
Date Sun, 05 Feb 2012 07:26:55 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200680#comment-13200680
] 

Christian Moen commented on LUCENE-3745:
----------------------------------------

Please find a patch attached.

I've made {{stoptags.txt}} lighter by not stopping all prefixes and also allowing auxiliary
verbs and interjections to pass.  I didn't come across any occurrences of unclassified symbols
(記号) in Wikipedia, but it is now stopped as that seem to align better with our overall
stop approach for symbols.

Many of the most frequent terms that now pass have been re-introduced in {{stopwords.txt}
so they are stopped using a {{StopFilter}} instead of {{KuromojiPartOfSpeechStopFilter}}.
 I believe this configuration is more balanced.

Overall, I've used the term frequencies attached to as a governing guideline for what to introduce
into {{stopwords.txt}}.  It mostly contains hiragana words and expressions and I've deliberately
left out common kanji as I'd like to keep the stopping fairly light.

I'll create a separate JIRA for introducing stopwords and stoptags to Solr.
                
> Need stopwords and stoptags lists for default Japanese configuration
> --------------------------------------------------------------------
>
>                 Key: LUCENE-3745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3745
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>         Attachments: LUCENE-3745.patch, filter_stoptags.py, top-100000.txt, top-1000000-pos.txt,
top-pos.txt
>
>
> Stopwords and stoptags lists for Japanese needs to be developed, tested and integrated
into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message