lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (Commented) (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3745) Need stopwords and stoptags lists for default Japanese configuration
Date Sat, 04 Feb 2012 05:47:53 GMT


Christian Moen commented on LUCENE-3745:

I'm attaching some lexical assets that are useful for building stopwords and stoptag lists.

The frequency lists are made from ~1.5 million segmented Japanese Wikipedia documents from
after some scrubbing and handling.  I'd prefer to use a more balanced corpus for this, but
I believe Wikipedia will be fine for this. 

The following files are attached in TSV format using UTF-8 encoding:

* {{top-pos.txt}} - Part-of-speech tag distribution
* {{top-100000.txt}} - Top 100,000 most frequent surface forms and their frequencies
* {{top-1000000-pos.txt}} - Top 1,000,000 most frequent surface form and part-of-speech tag
combinations and their frequencies

There's also a tool {{}} attached that reads a set of stoptags and evaluates
it on {{top-1000000-pos.txt}} to give us an idea what passes through any given stoptag set.

An example with my current stoptag set is given below.

{noformat} -s stoptags.txt top-1000000-pos.txt
stop: 、        freq: 14426806  pos: 記号-読点
stop: の        freq: 14212851  pos: 助詞-連体化
stop: 。        freq: 10553747  pos: 記号-句点
stop: は        freq: 8956177   pos: 助詞-係助詞
stop: に        freq: 8757138   pos: 助詞-格助詞-一般
stop: を        freq: 7723958   pos: 助詞-格助詞-一般
stop:           freq: 7417005   pos: 記号-空白
stop: た        freq: 7366368   pos: 助動詞
stop: が        freq: 5427730   pos: 助詞-格助詞-一般
stop: て        freq: 4874861   pos: 助詞-接続助詞
pass: し        freq: 4312613   pos: 動詞-自立
stop: で        freq: 3702106   pos: 助詞-格助詞-一般
stop:           freq: 3485125   pos: 記号-空白
stop: )        freq: 3049861   pos: 記号-括弧閉
stop: (        freq: 3045461   pos: 記号-括弧開
pass: れ        freq: 2722773   pos: 動詞-接尾
pass: さ        freq: 2441965   pos: 動詞-自立
stop: で        freq: 2403133   pos: 助動詞
stop: ・        freq: 2250725   pos: 記号-一般
stop: も        freq: 1962142   pos: 助詞-係助詞
pass: する      freq: 1959374   pos: 動詞-自立
pass: いる      freq: 1937789   pos: 動詞-非自立
stop: と        freq: 1927529   pos: 助詞-格助詞-引用
pass: 年        freq: 1796435   pos: 名詞-接尾-助数詞
stop: 「        freq: 1701848   pos: 記号-括弧開
stop: と        freq: 1697926   pos: 助詞-格助詞-一般
stop: 」        freq: 1672052   pos: 記号-括弧閉
stop: から      freq: 1414661   pos: 助詞-格助詞-一般
stop: ある      freq: 1400235   pos: 助動詞
stop:           freq: 1319235   pos: 記号-空白
pass: こと      freq: 1272503   pos: 名詞-非自立-一般
stop: な        freq: 1254673   pos: 助動詞
stop: が        freq: 1110771   pos: 助詞-接続助詞
pass: の        freq: 1037815   pos: 名詞-非自立-一般
stop: として    freq: 1002940   pos: 助詞-格助詞-連語
stop:           freq: 989166    pos: 記号-空白
pass: い        freq: 923836    pos: 動詞-非自立

> Need stopwords and stoptags lists for default Japanese configuration
> --------------------------------------------------------------------
>                 Key: LUCENE-3745
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Christian Moen
>         Attachments:, top-100000.txt, top-1000000-pos.txt, top-pos.txt
> Stopwords and stoptags lists for Japanese needs to be developed, tested and integrated
into Lucene.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message