lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Me <stone54321...@mac.com>
Subject Re: Need help "teaching" Japanese tokenizer to pick up slangs
Date Tue, 11 Mar 2014 01:24:50 GMT
Hi everybody

UerDictionary is right.
I am using yahoo Japanese tokenizer API (日本語形態素解析) to teach my own user dictionary.
http://developer.yahoo.co.jp/webapi/jlp/

On 2014/03/11, at 8:10, Rahul Ratnakar wrote:

> Worked perfectly for Japanese.
> 
> I have the same issue with Chinese Analyzer, I am using SmartChinese
> (lucene-analyzers-smartcn-4.6.0.jar) but I don't see a similar interface as
> the Japanese analyzer.  Is there an easy way to implement the same for
> Chinese?
> 
> 
> On Mon, Mar 10, 2014 at 3:26 PM, Rahul Ratnakar <rahul.ratnakar@gmail.com>wrote:
> 
>> Thanks Robert. This was exactly what I was looking for, will try this.
>> 
>> 
>> On Mon, Mar 10, 2014 at 3:13 PM, Robert Muir <rcmuir@gmail.com> wrote:
>> 
>>> You can pass UserDictionary with your own entries to do this.
>>> 
>>> On Mon, Mar 10, 2014 at 3:08 PM, Rahul Ratnakar
>>> <rahul.ratnakar@gmail.com> wrote:
>>>> Thanks Furkan, This is the exact tool that I am using, albeit in my
>>> code, I
>>>> have tried all search modes e.g.
>>>> 
>>>> new JapaneseAnalyzer(Version.LUCENE_46, null,
>>> JapaneseTokenizer.Mode.NORMAL,
>>>> JapaneseAnalyzer.getDefaultStopSet(),
>>> JapaneseAnalyzer.getDefaultStopTags())
>>>> new JapaneseAnalyzer(Version.LUCENE_46, null, JapaneseTokenizer.Mode.
>>>> EXTENDED, JapaneseAnalyzer.getDefaultStopSet(),
>>>> JapaneseAnalyzer.getDefaultStopTags())
>>>> 
>>>> new JapaneseAnalyzer(Version.LUCENE_46, null,
>>> JapaneseTokenizer.Mode.SEARCH,
>>>> JapaneseAnalyzer.getDefaultStopSet(),
>>>> JapaneseAnalyzer.getDefaultStopTags())
>>>> 
>>>> 
>>>> 
>>>> and none of them seem to tokenize the words as I want, so was wondering
>>> if
>>>> there is some way for me to actually "update" the dictionary/corpus so
>>> that
>>>> these slangs are caught by the tokenizer as single word.
>>>> 
>>>> 
>>>> My example text has been scrapped from an "adult" website, so it might
>>> be
>>>> offensive and i apologize for that. A small excerpt from that website:-
>>>> 
>>>> 
>>>> "裏びでお無料・無臭正 動画無料・無料aだると動画 裏びでお無料・無臭正
>>>> 動画無料・無料aだると動画・seっくす動画無料・裏ビデオ・ヘンリー塚本・ウラビデライフ無料動画・セッく動画無料・"
>>>> 
>>>> 
>>>> 
>>>> On tokenizing I get the list of tokens below. My problem is that as per
>>> my
>>>> in-house japanese language expert, this list breaks up the word  "無臭正
"
>>>> into  無臭 and 正 whereas it should be caught as a single word.   :-
>>>> 
>>>> 裏
>>>> 
>>>> びでお
>>>> 
>>>> 無料
>>>> 
>>>> 無臭
>>>> 
>>>> 正
>>>> 
>>>> 動画
>>>> 
>>>> 無料
>>>> 
>>>> 無料
>>>> 
>>>> a
>>>> 
>>>> 動画
>>>> 
>>>> 裏
>>>> 
>>>> びでお
>>>> 
>>>> 無料
>>>> 
>>>> 無臭
>>>> 
>>>> 正
>>>> 
>>>> 動画
>>>> 
>>>> 無料
>>>> 
>>>> 無料
>>>> 
>>>> a
>>>> 
>>>> 動画
>>>> 
>>>> se
>>>> 
>>>> く
>>>> 
>>>> くすい
>>>> 
>>>> 動画
>>>> 
>>>> 無料
>>>> 
>>>> 裏
>>>> 
>>>> ビデオ
>>>> 
>>>> ヘンリ
>>>> 
>>>> 塚本
>>>> 
>>>> ウラビデライフ
>>>> 
>>>> 無料
>>>> 
>>>> 動画
>>>> 
>>>> セッ
>>>> 
>>>> く
>>>> 
>>>> 動画
>>>> 
>>>> 無料
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> Rahul
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Mar 10, 2014 at 2:09 PM, Furkan KAMACI <furkankamaci@gmail.com
>>>> wrote:
>>>> 
>>>>> Hi;
>>>>> 
>>>>> Here is the page of it that has a online Kuromoji tokenizer and
>>>>> information: http://www.atilika.org/ It may help you.
>>>>> 
>>>>> Thanks;
>>>>> Furkan KAMACI
>>>>> 
>>>>> 
>>>>> 2014-03-10 19:57 GMT+02:00 Rahul Ratnakar <rahul.ratnakar@gmail.com>:
>>>>> 
>>>>>> I am trying to analyze some japanese web pages for presence of
>>>>> slang/adult
>>>>>> phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While
the
>>>>>> tokenizer breaks up the word into proper words, I am more interested
>>> in
>>>>>> catching the slangs which seems to result from combining various
>>> "safe"
>>>>>> words.
>>>>>> 
>>>>>> Few example of words that, as per our in-house japanese language
>>>>> expert,(I
>>>>>> have no knowledge of japanese whatsoever)  are slangs and should
be
>>>>> caught
>>>>>> "unbroken" are-
>>>>>> 
>>>>>> 無臭正 - is a bad word and we want to catch it as is, but the
tokenizer
>>>>> breaks
>>>>>> it up into 無臭 and 正 which are both apparently safe.
>>>>>> 
>>>>>> ハメ撮り - it was broken into ハメ and 撮り, again both safe
on their own
>>> but bad
>>>>>> when combined.
>>>>>> 
>>>>>> 中出し  broken into 中 and 出し, but should have been left
as is as it
>>>>> represents
>>>>>> a bad phrase.
>>>>>> 
>>>>>> Any help on how I can use kuromozi tokenizer or any alternatives
>>> would be
>>>>>> greatly appreciated.
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message