From java-user-return-64351-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Sun May 26 00:03:14 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 13B9418065D for ; Sun, 26 May 2019 02:03:13 +0200 (CEST) Received: (qmail 45403 invoked by uid 500); 26 May 2019 00:03:11 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 45389 invoked by uid 99); 26 May 2019 00:03:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 May 2019 00:03:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id C7116C2DA7 for ; Sun, 26 May 2019 00:03:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.6 X-Spam-Level: X-Spam-Status: No, score=0.6 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id geKvAqLuXfXb for ; Sun, 26 May 2019 00:03:09 +0000 (UTC) Received: from mail-ed1-f44.google.com (mail-ed1-f44.google.com [209.85.208.44]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 95C4B5FBD6 for ; Sun, 26 May 2019 00:03:08 +0000 (UTC) Received: by mail-ed1-f44.google.com with SMTP id b8so20260480edm.11 for ; Sat, 25 May 2019 17:03:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=hQkq05xq3heptW7nVksZTpfOw2DhP6hIvlpa150zIMM=; b=hIdEtdfrQsEKvK1wL45f3/gtEyHojVL6uKITgVzm+TccaCrjtQozE1fo1gapduBDI0 4cjGlN0eG629HQFs7VQhMchsOtYN4WDnC40LN/WW0eBq+lQ+TaMca0Enm3KslFxQ5vdI GhYOTVrd5zY8QlZ+HxS0ZQ7jnJC33+IggfI4FBw0V2nmVNItccfuhqy89SFj+/JVMKJn 1j81G7M6lv7OiMz23iILW+uIHaJJzFvJb0c1Hibk1HEo92e7e4mpb/rw73vRzqQ/IPl7 DDx7bVZHhV2p6am6Bf+05MLOOSHn8U3OwecKQQRW/2lRbJhdiSpY5rqtDdZjE/gLBIcM EY5g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=hQkq05xq3heptW7nVksZTpfOw2DhP6hIvlpa150zIMM=; b=J/5RaJm4PmvwXRt+1QQ0kJbHaQ7EQ15vbP5Oosp7woxxYyLC2mS6ISo1WFr8QIqW+a Fqlg1Pk6XbkZvsXj4gBGX8+r+WkzCYesxJN+bdjuems4+j/NnevkOiLVj5mlQ+4pyged BGwvPhD1fUl8WXxbWZAjcDYngNND6wGNMKxzNjwcyYHOgbtcOyQ8EDr1fEdxHovCcgJp 94AJz98y7MMYaroGKbwOwwFxlnJogp1g1dztWfIpgGTDc7aPpLDnn1w7uMOZopOmkhKB JOfTjuIP/CZOetwu4dd0ukXRp/T3sRaCbvEN12lrMFQBK85T9hcF4YrQ8t/7qrDxmPOu f0Eg== X-Gm-Message-State: APjAAAX4fKzyipyzdzSb6nTplr20zlMINOjbGJhAgwK2279B14Zk6Dx2 Q2+g9u2poQIZtWfvWupWUVKkkugKp8T9nHudAKdgDA== X-Google-Smtp-Source: APXvYqxDGd47GrwvKFsJ9PKqkin1cC1LpgJ2Q7bws9hxxBpfxdGUaQNgEm8xKTVXC/GWkSHctqEjtt2OdtsSnEecRbE= X-Received: by 2002:aa7:cf83:: with SMTP id z3mr114270087edx.240.1558828987929; Sat, 25 May 2019 17:03:07 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Michael Sokolov Date: Sat, 25 May 2019 20:02:56 -0400 Message-ID: Subject: Re: JapaneseAnalyzer's system vs user dict To: java-user@lucene.apache.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thank you for the detailed responses! What Tomoko is saying seems consistent with my cursory reading of the code. The reason I asked is I have a customer that thinks they want to replace the system dictionary, and I am trying to see if that is necessary. It seems as if for the most part, we can supply a comprehensive user dictionary and it would pretty much take the place of the system dictionary, assuming it is a superset (covers at least the original system dict tokens), but there is probably no way to "remove" a token that is present in the system dictionary (or maybe it can effectively be removed by adding it to user dictionary with a high penalty?). I'm not sure why one would want to do this removal, just trying to understand the design parameters. On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida wrote: > > Hi, > > > If I provide entries in the user > dictionary is it just as if I had included them in the system > dictionary? If the same entry occurs in both, do the user dictionary > weights supersede those in the system dictionary? Is there some way to > suppress entries in the system dict? > > User dictionary is independent from the system dictionary. If you give > the user entries, JapaneseTokenizer builds two FSTs one for the > built-in dictionary and one for the user dictionary and they are > retrieved separately. > > First the user dictionary is retrieved, and if there are no entries > matched then the system dictionary is retrieved. So if any entry is > found in the user dictionary, all possible candidates in the system > dictionary are ignored (suppressed). > > (I think this is kuromoji specific behaviour, the original mecab pos > tagger retrieves both of the system dictionary and user dictionary and > compares their weights by performing Viterbi. In fact the behaviour - > always gives priority to the entries in the user dictionary - is a bit > too aggressive from the point of view of the consistency of > tokenization. I do not know why, but there may be some performance > reasons?) > > I think you can easily find the retrieval logic I described here in > JapaneseTokenizer#parse() method. (Let me know if my understanding is > not correct.) > > Regards, > Tomoko > > 2019=E5=B9=B45=E6=9C=8826=E6=97=A5(=E6=97=A5) 5:08 =EA=B9=80=EB=82=A8=EA= =B7=9C : > > > > Hi, Mike :D > > > > Japanese Analyzer does not load dictionaries by default. > > If you look at the constructor, you can see that it is created as null = if > > not set parameters. > > (check testUserDict3() in TestJapaneseAnalyzer.java) > > > > In JapaneseTokenizer, > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > if (userDictionary !=3D null) { > > userFST =3D userDictionary.getFST(); > > userFSTReader =3D userFST.getBytesReader(); > > } else { > > userFST =3D null; > > userFSTReader =3D null; > > } > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > Since it is a way to create and pass the UserDictionary object, there i= s no > > conflict between user dictionary and system dictionary. > > (You may choose only one of them! -> means userFST instance in > > JapaneseTokenizer) > > > > About dictionary, > > Lucene has one pre-built dictionary by default since Lucene 3.6. > > You can check it in org.apache.lucene.analysis.ja.dict. > > It called MeCab which uses the Viterbi algorithm. > > In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to FST a= nd > > use > > But it can't satisfy all users. > > Depending on the situation, some user may need a custom dictionary. > > It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The basic > > logic(MeCab + FST) is similar to Japanese Analyzer) > > The original Korean MeCab dictionary size is almost 220MB, but Lucene's > > dictionary size is 24MB. > > If the user needs a dictionary of 100MB size, the user must build and u= se > > it. > > (Modify MeCab Dictionary -> Training -> Porting to Lucene) > > > > If anyone find some wrong information in my reply, please send a reply = with > > the correction. > > > > Thank you, > > Namgyu Kim > > > > > > 2019=EB=85=84 5=EC=9B=94 26=EC=9D=BC (=EC=9D=BC) =EC=98=A4=EC=A0=84 4:0= 3, Michael Sokolov =EB=8B=98=EC=9D=B4 =EC=9E=91=EC=84= =B1: > > > > > I'm trying to understand the relationship between the system and user > > > dictionaries that JapaneseAnalyzer uses. The API allows a user to > > > provide a user dictionary; the system one is built in. Are they > > > otherwise the same kind of thing? If I provide entries in the user > > > dictionary is it just as if I had included them in the system > > > dictionary? If the same entry occurs in both, do the user dictionary > > > weights supersede those in the system dictionary? Is there some way t= o > > > suppress entries in the system dict? I hunted for documentation, but > > > didn't find answers to these questions, and the code is pretty > > > involved, so any pointers would be greatly appreciated. > > > > > > -Mike > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org