lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Ratnakar <>
Subject Need help "teaching" Japanese tokenizer to pick up slangs
Date Mon, 10 Mar 2014 17:57:33 GMT
I am trying to analyze some japanese web pages for presence of slang/adult
phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the
tokenizer breaks up the word into proper words, I am more interested in
catching the slangs which seems to result from combining various "safe"

Few example of words that, as per our in-house japanese language expert,(I
have no knowledge of japanese whatsoever)  are slangs and should be caught
"unbroken" are-

無臭正 - is a bad word and we want to catch it as is, but the tokenizer breaks
it up into 無臭 and 正 which are both apparently safe.

ハメ撮り - it was broken into ハメ and 撮り, again both safe on their own but bad
when combined.

中出し  broken into 中 and 出し, but should have been left as is as it represents
a bad phrase.

Any help on how I can use kuromozi tokenizer or any alternatives would be
greatly appreciated.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message