lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration
Date Sat, 04 Feb 2012 05:47:53 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200339#comment-13200339
] 

Christian Moen commented on LUCENE-3745:
----------------------------------------

I'm attaching some lexical assets that are useful for building stopwords and stoptag lists.

The frequency lists are made from ~1.5 million segmented Japanese Wikipedia documents from
after some scrubbing and handling.  I'd prefer to use a more balanced corpus for this, but
I believe Wikipedia will be fine for this. 

The following files are attached in TSV format using UTF-8 encoding:

* {{top-pos.txt}} - Part-of-speech tag distribution
* {{top-100000.txt}} - Top 100,000 most frequent surface forms and their frequencies
* {{top-1000000-pos.txt}} - Top 1,000,000 most frequent surface form and part-of-speech tag
combinations and their frequencies

There's also a tool {{filter_stoptags.py}} attached that reads a set of stoptags and evaluates
it on {{top-1000000-pos.txt}} to give us an idea what passes through any given stoptag set.

An example with my current stoptag set is given below.

{noformat}
filter_stoptags.py -s stoptags.txt top-1000000-pos.txt
stop: 、        freq: 14426806  pos: 記号-読点
stop: の        freq: 14212851  pos: 助詞-連体化
stop: 。        freq: 10553747  pos: 記号-句点
stop: は        freq: 8956177   pos: 助詞-係助詞
stop: に        freq: 8757138   pos: 助詞-格助詞-一般
stop: を        freq: 7723958   pos: 助詞-格助詞-一般
stop:           freq: 7417005   pos: 記号-空白
stop: た        freq: 7366368   pos: 助動詞
stop: が        freq: 5427730   pos: 助詞-格助詞-一般
stop: て        freq: 4874861   pos: 助詞-接続助詞
pass: し        freq: 4312613   pos: 動詞-自立
stop: で        freq: 3702106   pos: 助詞-格助詞-一般
stop:           freq: 3485125   pos: 記号-空白
stop: )        freq: 3049861   pos: 記号-括弧閉
stop: (        freq: 3045461   pos: 記号-括弧開
pass: れ        freq: 2722773   pos: 動詞-接尾
pass: さ        freq: 2441965   pos: 動詞-自立
stop: で        freq: 2403133   pos: 助動詞
stop: ・        freq: 2250725   pos: 記号-一般
stop: も        freq: 1962142   pos: 助詞-係助詞
pass: する      freq: 1959374   pos: 動詞-自立
pass: いる      freq: 1937789   pos: 動詞-非自立
stop: と        freq: 1927529   pos: 助詞-格助詞-引用
pass: 年        freq: 1796435   pos: 名詞-接尾-助数詞
stop: 「        freq: 1701848   pos: 記号-括弧開
stop: と        freq: 1697926   pos: 助詞-格助詞-一般
stop: 」        freq: 1672052   pos: 記号-括弧閉
stop: から      freq: 1414661   pos: 助詞-格助詞-一般
stop: ある      freq: 1400235   pos: 助動詞
stop:           freq: 1319235   pos: 記号-空白
pass: こと      freq: 1272503   pos: 名詞-非自立-一般
stop: な        freq: 1254673   pos: 助動詞
stop: が        freq: 1110771   pos: 助詞-接続助詞
pass: の        freq: 1037815   pos: 名詞-非自立-一般
stop: として    freq: 1002940   pos: 助詞-格助詞-連語
stop:           freq: 989166    pos: 記号-空白
pass: い        freq: 923836    pos: 動詞-非自立
(...)
{noformat}

                
> Need stopwords and stoptags lists for default Japanese configuration
> --------------------------------------------------------------------
>
>                 Key: LUCENE-3745
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3745
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>         Attachments: filter_stoptags.py, top-100000.txt, top-1000000-pos.txt, top-pos.txt
>
>
> Stopwords and stoptags lists for Japanese needs to be developed, tested and integrated
into Lucene.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message