lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Ratnakar <rahul.ratna...@gmail.com>
Subject Re: Need help "teaching" Japanese tokenizer to pick up slangs
Date Mon, 10 Mar 2014 19:08:32 GMT
Thanks Furkan, This is the exact tool that I am using, albeit in my code, I
have tried all search modes e.g.

new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode.NORMAL,
JapaneseAnalyzer.getDefaultStopSet(), JapaneseAnalyzer.getDefaultStopTags())
new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode.
EXTENDED, JapaneseAnalyzer.getDefaultStopSet(),
JapaneseAnalyzer.getDefaultStopTags())

new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode.SEARCH,
JapaneseAnalyzer.getDefaultStopSet(),
JapaneseAnalyzer.getDefaultStopTags())



and none of them seem to tokenize the words as I want, so was wondering if
there is some way for me to actually "update" the dictionary/corpus so that
these slangs are caught by the tokenizer as single word.


My example text has been scrapped from an "adult" website, so it might be
offensive and i apologize for that. A small excerpt from that website:-


"裏びでお無料・無臭正 動画無料・無料aだると動画 裏びでお無料・無臭正
動画無料・無料aだると動画・seっくす動画無料・裏ビデオ・ヘンリー塚本・ウラビデライフ無料動画・セッく動画無料・"



On tokenizing I get the list of tokens below. My problem is that as per my
in-house japanese language expert, this list breaks up the word  "無臭正 "
into  無臭 and 正 whereas it should be caught as a single word.   :-

裏

びでお

無料

無臭

正

動画

無料

無料

a

動画

裏

びでお

無料

無臭

正

動画

無料

無料

a

動画

se

く

くすい

動画

無料

裏

ビデオ

ヘンリ

塚本

ウラビデライフ

無料

動画

セッ

く

動画

無料


Thanks,

Rahul





On Mon, Mar 10, 2014 at 2:09 PM, Furkan KAMACI <furkankamaci@gmail.com>wrote:

> Hi;
>
> Here is the page of it that has a online Kuromoji tokenizer and
> information: http://www.atilika.org/ It may help you.
>
> Thanks;
> Furkan KAMACI
>
>
> 2014-03-10 19:57 GMT+02:00 Rahul Ratnakar <rahul.ratnakar@gmail.com>:
>
> > I am trying to analyze some japanese web pages for presence of
> slang/adult
> > phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the
> > tokenizer breaks up the word into proper words, I am more interested in
> > catching the slangs which seems to result from combining various "safe"
> > words.
> >
> > Few example of words that, as per our in-house japanese language
> expert,(I
> > have no knowledge of japanese whatsoever)  are slangs and should be
> caught
> > "unbroken" are-
> >
> > 無臭正 - is a bad word and we want to catch it as is, but the tokenizer
> breaks
> > it up into 無臭 and 正 which are both apparently safe.
> >
> > ハメ撮り - it was broken into ハメ and 撮り, again both safe on their own
but bad
> > when combined.
> >
> > 中出し  broken into 中 and 出し, but should have been left as is as it
> represents
> > a bad phrase.
> >
> > Any help on how I can use kuromozi tokenizer or any alternatives would be
> > greatly appreciated.
> >
> > Thanks.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message