lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Need help "teaching" Japanese tokenizer to pick up slangs
Date Mon, 10 Mar 2014 19:13:13 GMT
You can pass UserDictionary with your own entries to do this.

On Mon, Mar 10, 2014 at 3:08 PM, Rahul Ratnakar
<rahul.ratnakar@gmail.com> wrote:
> Thanks Furkan, This is the exact tool that I am using, albeit in my code, I
> have tried all search modes e.g.
>
> new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode.NORMAL,
> JapaneseAnalyzer.getDefaultStopSet(), JapaneseAnalyzer.getDefaultStopTags())
> new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode.
> EXTENDED, JapaneseAnalyzer.getDefaultStopSet(),
> JapaneseAnalyzer.getDefaultStopTags())
>
> new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode.SEARCH,
> JapaneseAnalyzer.getDefaultStopSet(),
> JapaneseAnalyzer.getDefaultStopTags())
>
>
>
> and none of them seem to tokenize the words as I want, so was wondering if
> there is some way for me to actually "update" the dictionary/corpus so that
> these slangs are caught by the tokenizer as single word.
>
>
> My example text has been scrapped from an "adult" website, so it might be
> offensive and i apologize for that. A small excerpt from that website:-
>
>
> "裏びでお無料・無臭正 動画無料・無料aだると動画 裏びでお無料・無臭正
> 動画無料・無料aだると動画・seっくす動画無料・裏ビデオ・ヘンリー塚本・ウラビデライフ無料動画・セッく動画無料・"
>
>
>
> On tokenizing I get the list of tokens below. My problem is that as per my
> in-house japanese language expert, this list breaks up the word  "無臭正 "
> into  無臭 and 正 whereas it should be caught as a single word.   :-
>
> 裏
>
> びでお
>
> 無料
>
> 無臭
>
> 正
>
> 動画
>
> 無料
>
> 無料
>
> a
>
> 動画
>
> 裏
>
> びでお
>
> 無料
>
> 無臭
>
> 正
>
> 動画
>
> 無料
>
> 無料
>
> a
>
> 動画
>
> se
>
> く
>
> くすい
>
> 動画
>
> 無料
>
> 裏
>
> ビデオ
>
> ヘンリ
>
> 塚本
>
> ウラビデライフ
>
> 無料
>
> 動画
>
> セッ
>
> く
>
> 動画
>
> 無料
>
>
> Thanks,
>
> Rahul
>
>
>
>
>
> On Mon, Mar 10, 2014 at 2:09 PM, Furkan KAMACI <furkankamaci@gmail.com>wrote:
>
>> Hi;
>>
>> Here is the page of it that has a online Kuromoji tokenizer and
>> information: http://www.atilika.org/ It may help you.
>>
>> Thanks;
>> Furkan KAMACI
>>
>>
>> 2014-03-10 19:57 GMT+02:00 Rahul Ratnakar <rahul.ratnakar@gmail.com>:
>>
>> > I am trying to analyze some japanese web pages for presence of
>> slang/adult
>> > phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the
>> > tokenizer breaks up the word into proper words, I am more interested in
>> > catching the slangs which seems to result from combining various "safe"
>> > words.
>> >
>> > Few example of words that, as per our in-house japanese language
>> expert,(I
>> > have no knowledge of japanese whatsoever)  are slangs and should be
>> caught
>> > "unbroken" are-
>> >
>> > 無臭正 - is a bad word and we want to catch it as is, but the tokenizer
>> breaks
>> > it up into 無臭 and 正 which are both apparently safe.
>> >
>> > ハメ撮り - it was broken into ハメ and 撮り, again both safe on their
own but bad
>> > when combined.
>> >
>> > 中出し  broken into 中 and 出し, but should have been left as is as it
>> represents
>> > a bad phrase.
>> >
>> > Any help on how I can use kuromozi tokenizer or any alternatives would be
>> > greatly appreciated.
>> >
>> > Thanks.
>> >
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message