From dev-return-317602-archive-asf-public=cust-asf.ponee.io@lucene.apache.org  Wed Apr  4 11:06:06 2018
Return-Path: <dev-return-317602-archive-asf-public=cust-asf.ponee.io@lucene.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 2F97A18064F
	for <archive-asf-public@cust-asf.ponee.io>; Wed,  4 Apr 2018 11:06:06 +0200 (CEST)
Received: (qmail 19946 invoked by uid 500); 4 Apr 2018 09:06:04 -0000
Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@lucene.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@lucene.apache.org>
List-Post: <mailto:dev@lucene.apache.org>
List-Id: <dev.lucene.apache.org>
Reply-To: dev@lucene.apache.org
Delivered-To: mailing list dev@lucene.apache.org
Received: (qmail 19936 invoked by uid 99); 4 Apr 2018 09:06:04 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Apr 2018 09:06:04 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 0FFA8C07B4
	for <dev@lucene.apache.org>; Wed,  4 Apr 2018 09:06:04 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -110.311
X-Spam-Level:
X-Spam-Status: No, score=-110.311 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3,
	SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5,
	USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id xgDkH2bySS4L for <dev@lucene.apache.org>;
	Wed,  4 Apr 2018 09:06:02 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 26C7E5F2A8
	for <dev@lucene.apache.org>; Wed,  4 Apr 2018 09:06:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 55443E002F
	for <dev@lucene.apache.org>; Wed,  4 Apr 2018 09:06:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1D44525613
	for <dev@lucene.apache.org>; Wed,  4 Apr 2018 09:06:00 +0000 (UTC)
Date: Wed, 4 Apr 2018 09:06:00 +0000 (UTC)
From: "Jim Ferenczi (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: <JIRA.13148713.1522272070000.179034.1522832760118@Atlassian.JIRA>
In-Reply-To: <JIRA.13148713.1522272070000@Atlassian.JIRA>
References: <JIRA.13148713.1522272070000@Atlassian.JIRA> <JIRA.13148713.1522272070588@jira-lw-us.apache.org>
Subject: [jira] [Commented] (LUCENE-8231) Nori, a Korean analyzer based on
 mecab-ko-dic
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/LUCENE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16425218#comment-16425218 ] 

Jim Ferenczi commented on LUCENE-8231:
--------------------------------------

Hi Robert,
I pushed another iteration that moves the decompound process and the POS filtering in the tokenizer. I think it's simpler to perform the decompound and the filtering directly in the tokenizer, this also allows to keep the compound token (I added a decompound mode option that disallow decompound (none), discard the decompound (discard) or perform the decompound and keep the original token (mixed)). By default the compound token is discarded but it can be kept using the mixed mode. 
I also changed the normalization option when building the dictionary, instead of adding the normalized form and the original form the builder now replaces the original form with the normalized one. By default the normalization is not activated but it can be useful for other Korean dictionaries that uses a decomposed form for hanguls like the Handic for instance:
https://ja.osdn.net/projects/handic/
I added more tests and javadocs, I think it's getting closer ;)


> Nori, a Korean analyzer based on mecab-ko-dic
> ---------------------------------------------
>
>                 Key: LUCENE-8231
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8231
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Jim Ferenczi
>            Priority: Major
>         Attachments: LUCENE-8231-remap-hangul.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch, LUCENE-8231.patch
>
>
> There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
> It is available under an Apache license here:
> https://bitbucket.org/eunjeon/mecab-ko-dic
> This dictionary was built with MeCab, it defines a format for the features adapted for the Korean language.
> Since the Kuromoji tokenizer uses the same format for the morphological analysis (left cost + right cost + word cost) I tried to adapt the module to handle Korean with the mecab-ko-dic. I've started with a POC that copies the Kuromoji module and adapts it for the mecab-ko-dic.
> I used the same classes to build and read the dictionary but I had to make some modifications to handle the differences with the IPADIC and Japanese. 
> The resulting binary dictionary takes 28MB on disk, it's bigger than the IPADIC but mainly because the source is bigger and there are a lot of
> compound and inflect terms that define a group of terms and the segmentation that can be applied. 
> I attached the patch that contains this new Korean module called -godori- nori. It is an adaptation of the Kuromoji module so currently
> the two modules don't share any code. I wanted to validate the approach first and check the relevancy of the results. I don't speak Korean so I used the relevancy
> tests that was added for another Korean tokenizer (https://issues.apache.org/jira/browse/LUCENE-4956) and tested the output against mecab-ko which is the official fork of mecab to use the mecab-ko-dic.
> I had to simplify the JapaneseTokenizer, my version removes the nBest output and the decomposition of too long tokens. I also
> modified the handling of whitespaces since they are important in Korean. Whitespaces that appear before a term are attached to that term and this
> information is used to compute a penalty based on the Part of Speech of the token. The penalty cost is a feature added to mecab-ko to handle 
> morphemes that should not appear after a morpheme and is described in the mecab-ko page:
> https://bitbucket.org/eunjeon/mecab-ko
> Ignoring whitespaces is also more inlined with the official MeCab library which attach the whitespaces to the term that follows.
> I also added a decompounder filter that expand the compounds and inflects defined in the dictionary and a part of speech filter similar to the Japanese
> that removes the morpheme that are not useful for relevance (suffix, prefix, interjection, ...). These filters don't play well with the tokenizer if it can 
> output multiple paths (nBest output for instance) so for simplicity I removed this ability and the Korean tokenizer only outputs the best path.
> I compared the result with mecab-ko to confirm that the analyzer is working and ran the relevancy test that is defined in HantecRel.java included
> in the patch (written by Robert for another Korean analyzer). Here are the results:
> ||Analyzer||Index Time||Index Size||MAP(CLASSIC)||MAP(BM25)||MAP(GL2)||
> |Standard|35s|131MB|.007|.1044|.1053|
> |CJK|36s|164MB|.1418|.1924|.1916|
> |Korean|212s|90MB|.1628|.2094|.2078|
> I find the results very promising so I plan to continue to work on this project. I started to extract the part of the code that could be shared with the
> Kuromoji module but I wanted to share the status and this POC first to confirm that this approach is viable. The advantages of using the same model than
> the Japanese analyzer are multiple: we don't have a Korean analyzer at the moment ;), the resulting dictionary is small compared to other libraries that
> use the mecab-ko-dic (the FST takes only 5.4MB) and the Tokenizer prunes the lattice on the fly to select the best path efficiently.
> The dictionary can be built directly from the godori module with the following command:
> ant regenerate (you need to create the resource directory (mkdir lucene/analysis/godori/src/resources/org/apache/lucene/analysis/ko/dict) first since the dictionary is not included in the patch).
> I've also added some minimal tests in the module to play with the analysis.


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org