From java-user-return-64351-archive-asf-public=cust-asf.ponee.io@lucene.apache.org  Sun May 26 00:03:14 2019
Return-Path: <java-user-return-64351-archive-asf-public=cust-asf.ponee.io@lucene.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 13B9418065D
	for <archive-asf-public@cust-asf.ponee.io>; Sun, 26 May 2019 02:03:13 +0200 (CEST)
Received: (qmail 45403 invoked by uid 500); 26 May 2019 00:03:11 -0000
Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:java-user-help@lucene.apache.org>
List-Unsubscribe: <mailto:java-user-unsubscribe@lucene.apache.org>
List-Post: <mailto:java-user@lucene.apache.org>
List-Id: <java-user.lucene.apache.org>
Reply-To: java-user@lucene.apache.org
Delivered-To: mailing list java-user@lucene.apache.org
Received: (qmail 45389 invoked by uid 99); 26 May 2019 00:03:11 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 May 2019 00:03:11 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id C7116C2DA7
	for <java-user@lucene.apache.org>; Sun, 26 May 2019 00:03:10 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0.6
X-Spam-Level:
X-Spam-Status: No, score=0.6 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, KAM_ASCII_DIVIDERS=0.8,
	RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001]
	autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id geKvAqLuXfXb for <java-user@lucene.apache.org>;
	Sun, 26 May 2019 00:03:09 +0000 (UTC)
Received: from mail-ed1-f44.google.com (mail-ed1-f44.google.com [209.85.208.44])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 95C4B5FBD6
	for <java-user@lucene.apache.org>; Sun, 26 May 2019 00:03:08 +0000 (UTC)
Received: by mail-ed1-f44.google.com with SMTP id b8so20260480edm.11
        for <java-user@lucene.apache.org>; Sat, 25 May 2019 17:03:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :content-transfer-encoding;
        bh=hQkq05xq3heptW7nVksZTpfOw2DhP6hIvlpa150zIMM=;
        b=hIdEtdfrQsEKvK1wL45f3/gtEyHojVL6uKITgVzm+TccaCrjtQozE1fo1gapduBDI0
         4cjGlN0eG629HQFs7VQhMchsOtYN4WDnC40LN/WW0eBq+lQ+TaMca0Enm3KslFxQ5vdI
         GhYOTVrd5zY8QlZ+HxS0ZQ7jnJC33+IggfI4FBw0V2nmVNItccfuhqy89SFj+/JVMKJn
         1j81G7M6lv7OiMz23iILW+uIHaJJzFvJb0c1Hibk1HEo92e7e4mpb/rw73vRzqQ/IPl7
         DDx7bVZHhV2p6am6Bf+05MLOOSHn8U3OwecKQQRW/2lRbJhdiSpY5rqtDdZjE/gLBIcM
         EY5g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:content-transfer-encoding;
        bh=hQkq05xq3heptW7nVksZTpfOw2DhP6hIvlpa150zIMM=;
        b=J/5RaJm4PmvwXRt+1QQ0kJbHaQ7EQ15vbP5Oosp7woxxYyLC2mS6ISo1WFr8QIqW+a
         Fqlg1Pk6XbkZvsXj4gBGX8+r+WkzCYesxJN+bdjuems4+j/NnevkOiLVj5mlQ+4pyged
         BGwvPhD1fUl8WXxbWZAjcDYngNND6wGNMKxzNjwcyYHOgbtcOyQ8EDr1fEdxHovCcgJp
         94AJz98y7MMYaroGKbwOwwFxlnJogp1g1dztWfIpgGTDc7aPpLDnn1w7uMOZopOmkhKB
         JOfTjuIP/CZOetwu4dd0ukXRp/T3sRaCbvEN12lrMFQBK85T9hcF4YrQ8t/7qrDxmPOu
         f0Eg==
X-Gm-Message-State: APjAAAX4fKzyipyzdzSb6nTplr20zlMINOjbGJhAgwK2279B14Zk6Dx2
	Q2+g9u2poQIZtWfvWupWUVKkkugKp8T9nHudAKdgDA==
X-Google-Smtp-Source: APXvYqxDGd47GrwvKFsJ9PKqkin1cC1LpgJ2Q7bws9hxxBpfxdGUaQNgEm8xKTVXC/GWkSHctqEjtt2OdtsSnEecRbE=
X-Received: by 2002:aa7:cf83:: with SMTP id z3mr114270087edx.240.1558828987929;
 Sat, 25 May 2019 17:03:07 -0700 (PDT)
MIME-Version: 1.0
References: <CAGUSZHA3U_vWpRfxQb4jttT7sAOu+uaU8MfvXSYgNP9s9JNsXw@mail.gmail.com>
 <CAB_JB7c3UBhQgcdXXDh8LCU0bwK0Hj9rFyw53v_CjSoR=aVHYg@mail.gmail.com> <CAHpHujkgLu8t4KgYZzeRJNtcS-FOVAAtRFWgsBSueVZokJ=9hg@mail.gmail.com>
In-Reply-To: <CAHpHujkgLu8t4KgYZzeRJNtcS-FOVAAtRFWgsBSueVZokJ=9hg@mail.gmail.com>
From: Michael Sokolov <msokolov@gmail.com>
Date: Sat, 25 May 2019 20:02:56 -0400
Message-ID: <CAGUSZHB49GEGXagiMXmuDbdhKiUh__xjO_Q+TNi8ruvPQ=YtSg@mail.gmail.com>
Subject: Re: JapaneseAnalyzer's system vs user dict
To: java-user@lucene.apache.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Thank you for the detailed responses! What Tomoko is saying seems
consistent with my cursory reading of the code. The reason I asked is
I have a customer that thinks they want to replace the system
dictionary, and I am trying to see if that is necessary. It seems as
if for the most part, we can supply a comprehensive user dictionary
and it would pretty much take the place of the system dictionary,
assuming it is a superset (covers at least the original system dict
tokens), but there is probably no way to "remove" a token that is
present in the system dictionary (or maybe it can effectively be
removed by adding it to user dictionary with a high penalty?). I'm not
sure why one would want to do this removal, just trying to understand
the design parameters.

On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida
<tomoko.uchida.1111@gmail.com> wrote:
>
> Hi,
>
> > If I provide entries in the user
> dictionary is it just as if I had included them in the system
> dictionary? If the same entry occurs in both, do the user dictionary
> weights supersede those in the system dictionary? Is there some way to
> suppress entries in the system dict?
>
> User dictionary is independent from the system dictionary. If you give
> the user entries, JapaneseTokenizer builds two FSTs one for the
> built-in dictionary and one for the user dictionary and they are
> retrieved separately.
>
> First the user dictionary is retrieved, and if there are no entries
> matched then the system dictionary is retrieved. So if any entry is
> found in the user dictionary, all possible candidates in the system
> dictionary are ignored (suppressed).
>
> (I think this is kuromoji specific behaviour, the original mecab pos
> tagger retrieves both of the system dictionary and user dictionary and
> compares their weights by performing Viterbi. In fact the behaviour -
> always gives priority to the entries in the user dictionary - is a bit
> too aggressive from the point of view of the consistency of
> tokenization. I do not know why, but there may be some performance
> reasons?)
>
> I think you can easily find the retrieval logic I described here in
> JapaneseTokenizer#parse() method. (Let me know if my understanding is
> not correct.)
>
> Regards,
> Tomoko
>
> 2019=E5=B9=B45=E6=9C=8826=E6=97=A5(=E6=97=A5) 5:08 =EA=B9=80=EB=82=A8=EA=
=B7=9C <kng0828@gmail.com>:
> >
> > Hi, Mike :D
> >
> > Japanese Analyzer does not load dictionaries by default.
> > If you look at the constructor, you can see that it is created as null =
if
> > not set parameters.
> > (check testUserDict3() in TestJapaneseAnalyzer.java)
> >
> > In JapaneseTokenizer,
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > if (userDictionary !=3D null) {
> >   userFST =3D userDictionary.getFST();
> >   userFSTReader =3D userFST.getBytesReader();
> > } else {
> >   userFST =3D null;
> >   userFSTReader =3D null;
> > }
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > Since it is a way to create and pass the UserDictionary object, there i=
s no
> > conflict between user dictionary and system dictionary.
> > (You may choose only one of them! -> means userFST instance in
> > JapaneseTokenizer)
> >
> > About dictionary,
> > Lucene has one pre-built dictionary by default since Lucene 3.6.
> > You can check it in org.apache.lucene.analysis.ja.dict.
> > It called MeCab which uses the Viterbi algorithm.
> > In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to FST a=
nd
> > use
> > But it can't satisfy all users.
> > Depending on the situation, some user may need a custom dictionary.
> > It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The basic
> > logic(MeCab + FST) is similar to Japanese Analyzer)
> > The original Korean MeCab dictionary size is almost 220MB, but Lucene's
> > dictionary size is 24MB.
> > If the user needs a dictionary of 100MB size, the user must build and u=
se
> > it.
> > (Modify MeCab Dictionary -> Training -> Porting to Lucene)
> >
> > If anyone find some wrong information in my reply, please send a reply =
with
> > the correction.
> >
> > Thank you,
> > Namgyu Kim
> >
> >
> > 2019=EB=85=84 5=EC=9B=94 26=EC=9D=BC (=EC=9D=BC) =EC=98=A4=EC=A0=84 4:0=
3, Michael Sokolov <msokolov@gmail.com>=EB=8B=98=EC=9D=B4 =EC=9E=91=EC=84=
=B1:
> >
> > > I'm trying to understand the relationship between the system and user
> > > dictionaries that JapaneseAnalyzer uses. The API allows a user to
> > > provide a user dictionary; the system one is built in. Are they
> > > otherwise the same kind of thing? If I provide entries in the user
> > > dictionary is it just as if I had included them in the system
> > > dictionary? If the same entry occurs in both, do the user dictionary
> > > weights supersede those in the system dictionary? Is there some way t=
o
> > > suppress entries in the system dict?  I hunted for documentation, but
> > > didn't find answers to these questions, and the code is pretty
> > > involved, so any pointers would be greatly appreciated.
> > >
> > > -Mike
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org