lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Ratnakar <rahul.ratna...@gmail.com>
Subject Re: Need help "teaching" Japanese tokenizer to pick up slangs
Date Mon, 10 Mar 2014 19:26:14 GMT
Thanks Robert. This was exactly what I was looking for, will try this.


On Mon, Mar 10, 2014 at 3:13 PM, Robert Muir <rcmuir@gmail.com> wrote:

> You can pass UserDictionary with your own entries to do this.
>
> On Mon, Mar 10, 2014 at 3:08 PM, Rahul Ratnakar
> <rahul.ratnakar@gmail.com> wrote:
> > Thanks Furkan, This is the exact tool that I am using, albeit in my
> code, I
> > have tried all search modes e.g.
> >
> > new JapaneseAnalyzer(Version.LUCENE_46, null,
> JapaneseTokenizer.Mode.NORMAL,
> > JapaneseAnalyzer.getDefaultStopSet(),
> JapaneseAnalyzer.getDefaultStopTags())
> > new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode.
> > EXTENDED, JapaneseAnalyzer.getDefaultStopSet(),
> > JapaneseAnalyzer.getDefaultStopTags())
> >
> > new JapaneseAnalyzer(Version.LUCENE_46, null,
> JapaneseTokenizer.Mode.SEARCH,
> > JapaneseAnalyzer.getDefaultStopSet(),
> > JapaneseAnalyzer.getDefaultStopTags())
> >
> >
> >
> > and none of them seem to tokenize the words as I want, so was wondering
> if
> > there is some way for me to actually "update" the dictionary/corpus so
> that
> > these slangs are caught by the tokenizer as single word.
> >
> >
> > My example text has been scrapped from an "adult" website, so it might be
> > offensive and i apologize for that. A small excerpt from that website:-
> >
> >
> > "裏びでお無料・無臭正 動画無料・無料aだると動画 裏びでお無料・無臭正
> > 動画無料・無料aだると動画・seっくす動画無料・裏ビデオ・ヘンリー塚本・ウラビデライフ無料動画・セッく動画無料・"
> >
> >
> >
> > On tokenizing I get the list of tokens below. My problem is that as per
> my
> > in-house japanese language expert, this list breaks up the word  "無臭正 "
> > into  無臭 and 正 whereas it should be caught as a single word.   :-
> >
> > 裏
> >
> > びでお
> >
> > 無料
> >
> > 無臭
> >
> > 正
> >
> > 動画
> >
> > 無料
> >
> > 無料
> >
> > a
> >
> > 動画
> >
> > 裏
> >
> > びでお
> >
> > 無料
> >
> > 無臭
> >
> > 正
> >
> > 動画
> >
> > 無料
> >
> > 無料
> >
> > a
> >
> > 動画
> >
> > se
> >
> > く
> >
> > くすい
> >
> > 動画
> >
> > 無料
> >
> > 裏
> >
> > ビデオ
> >
> > ヘンリ
> >
> > 塚本
> >
> > ウラビデライフ
> >
> > 無料
> >
> > 動画
> >
> > セッ
> >
> > く
> >
> > 動画
> >
> > 無料
> >
> >
> > Thanks,
> >
> > Rahul
> >
> >
> >
> >
> >
> > On Mon, Mar 10, 2014 at 2:09 PM, Furkan KAMACI <furkankamaci@gmail.com
> >wrote:
> >
> >> Hi;
> >>
> >> Here is the page of it that has a online Kuromoji tokenizer and
> >> information: http://www.atilika.org/ It may help you.
> >>
> >> Thanks;
> >> Furkan KAMACI
> >>
> >>
> >> 2014-03-10 19:57 GMT+02:00 Rahul Ratnakar <rahul.ratnakar@gmail.com>:
> >>
> >> > I am trying to analyze some japanese web pages for presence of
> >> slang/adult
> >> > phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the
> >> > tokenizer breaks up the word into proper words, I am more interested
> in
> >> > catching the slangs which seems to result from combining various
> "safe"
> >> > words.
> >> >
> >> > Few example of words that, as per our in-house japanese language
> >> expert,(I
> >> > have no knowledge of japanese whatsoever)  are slangs and should be
> >> caught
> >> > "unbroken" are-
> >> >
> >> > 無臭正 - is a bad word and we want to catch it as is, but the tokenizer
> >> breaks
> >> > it up into 無臭 and 正 which are both apparently safe.
> >> >
> >> > ハメ撮り - it was broken into ハメ and 撮り, again both safe on
their own but
> bad
> >> > when combined.
> >> >
> >> > 中出し  broken into 中 and 出し, but should have been left as is
as it
> >> represents
> >> > a bad phrase.
> >> >
> >> > Any help on how I can use kuromozi tokenizer or any alternatives
> would be
> >> > greatly appreciated.
> >> >
> >> > Thanks.
> >> >
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message