lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Ratnakar <rahul.ratna...@gmail.com>
Subject Re: Need help "teaching" Japanese tokenizer to pick up slangs
Date Mon, 10 Mar 2014 23:10:48 GMT
Worked perfectly for Japanese.

I have the same issue with Chinese Analyzer, I am using SmartChinese
(lucene-analyzers-smartcn-4.6.0.jar) but I don't see a similar interface as
the Japanese analyzer.  Is there an easy way to implement the same for
Chinese?


On Mon, Mar 10, 2014 at 3:26 PM, Rahul Ratnakar <rahul.ratnakar@gmail.com>wrote:

> Thanks Robert. This was exactly what I was looking for, will try this.
>
>
> On Mon, Mar 10, 2014 at 3:13 PM, Robert Muir <rcmuir@gmail.com> wrote:
>
>> You can pass UserDictionary with your own entries to do this.
>>
>> On Mon, Mar 10, 2014 at 3:08 PM, Rahul Ratnakar
>> <rahul.ratnakar@gmail.com> wrote:
>> > Thanks Furkan, This is the exact tool that I am using, albeit in my
>> code, I
>> > have tried all search modes e.g.
>> >
>> > new JapaneseAnalyzer(Version.LUCENE_46, null,
>> JapaneseTokenizer.Mode.NORMAL,
>> > JapaneseAnalyzer.getDefaultStopSet(),
>> JapaneseAnalyzer.getDefaultStopTags())
>> > new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode.
>> > EXTENDED, JapaneseAnalyzer.getDefaultStopSet(),
>> > JapaneseAnalyzer.getDefaultStopTags())
>> >
>> > new JapaneseAnalyzer(Version.LUCENE_46, null,
>> JapaneseTokenizer.Mode.SEARCH,
>> > JapaneseAnalyzer.getDefaultStopSet(),
>> > JapaneseAnalyzer.getDefaultStopTags())
>> >
>> >
>> >
>> > and none of them seem to tokenize the words as I want, so was wondering
>> if
>> > there is some way for me to actually "update" the dictionary/corpus so
>> that
>> > these slangs are caught by the tokenizer as single word.
>> >
>> >
>> > My example text has been scrapped from an "adult" website, so it might
>> be
>> > offensive and i apologize for that. A small excerpt from that website:-
>> >
>> >
>> > "裏びでお無料・無臭正 動画無料・無料aだると動画 裏びでお無料・無臭正
>> > 動画無料・無料aだると動画・seっくす動画無料・裏ビデオ・ヘンリー塚本・ウラビデライフ無料動画・セッく動画無料・"
>> >
>> >
>> >
>> > On tokenizing I get the list of tokens below. My problem is that as per
>> my
>> > in-house japanese language expert, this list breaks up the word  "無臭正
"
>> > into  無臭 and 正 whereas it should be caught as a single word.   :-
>> >
>> > 裏
>> >
>> > びでお
>> >
>> > 無料
>> >
>> > 無臭
>> >
>> > 正
>> >
>> > 動画
>> >
>> > 無料
>> >
>> > 無料
>> >
>> > a
>> >
>> > 動画
>> >
>> > 裏
>> >
>> > びでお
>> >
>> > 無料
>> >
>> > 無臭
>> >
>> > 正
>> >
>> > 動画
>> >
>> > 無料
>> >
>> > 無料
>> >
>> > a
>> >
>> > 動画
>> >
>> > se
>> >
>> > く
>> >
>> > くすい
>> >
>> > 動画
>> >
>> > 無料
>> >
>> > 裏
>> >
>> > ビデオ
>> >
>> > ヘンリ
>> >
>> > 塚本
>> >
>> > ウラビデライフ
>> >
>> > 無料
>> >
>> > 動画
>> >
>> > セッ
>> >
>> > く
>> >
>> > 動画
>> >
>> > 無料
>> >
>> >
>> > Thanks,
>> >
>> > Rahul
>> >
>> >
>> >
>> >
>> >
>> > On Mon, Mar 10, 2014 at 2:09 PM, Furkan KAMACI <furkankamaci@gmail.com
>> >wrote:
>> >
>> >> Hi;
>> >>
>> >> Here is the page of it that has a online Kuromoji tokenizer and
>> >> information: http://www.atilika.org/ It may help you.
>> >>
>> >> Thanks;
>> >> Furkan KAMACI
>> >>
>> >>
>> >> 2014-03-10 19:57 GMT+02:00 Rahul Ratnakar <rahul.ratnakar@gmail.com>:
>> >>
>> >> > I am trying to analyze some japanese web pages for presence of
>> >> slang/adult
>> >> > phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the
>> >> > tokenizer breaks up the word into proper words, I am more interested
>> in
>> >> > catching the slangs which seems to result from combining various
>> "safe"
>> >> > words.
>> >> >
>> >> > Few example of words that, as per our in-house japanese language
>> >> expert,(I
>> >> > have no knowledge of japanese whatsoever)  are slangs and should be
>> >> caught
>> >> > "unbroken" are-
>> >> >
>> >> > 無臭正 - is a bad word and we want to catch it as is, but the tokenizer
>> >> breaks
>> >> > it up into 無臭 and 正 which are both apparently safe.
>> >> >
>> >> > ハメ撮り - it was broken into ハメ and 撮り, again both safe
on their own
>> but bad
>> >> > when combined.
>> >> >
>> >> > 中出し  broken into 中 and 出し, but should have been left as
is as it
>> >> represents
>> >> > a bad phrase.
>> >> >
>> >> > Any help on how I can use kuromozi tokenizer or any alternatives
>> would be
>> >> > greatly appreciated.
>> >> >
>> >> > Thanks.
>> >> >
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message