From solr-user-return-142849-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Fri Aug 3 15:21:48 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 733D0180647 for ; Fri, 3 Aug 2018 15:21:48 +0200 (CEST) Received: (qmail 79399 invoked by uid 500); 3 Aug 2018 13:21:46 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 79347 invoked by uid 99); 3 Aug 2018 13:21:46 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Aug 2018 13:21:45 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 62831C1D14 for ; Fri, 3 Aug 2018 13:21:45 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.898 X-Spam-Level: X-Spam-Status: No, score=0.898 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, KAM_LIVE=1, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=elyograg.org Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id SVHsC-qmPwZQ for ; Fri, 3 Aug 2018 13:21:44 +0000 (UTC) Received: from frodo.elyograg.org (frodo.elyograg.org [166.70.79.217]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 3BBA25F27E for ; Fri, 3 Aug 2018 13:21:43 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by frodo.elyograg.org (Postfix) with ESMTP id 25027BEA for ; Fri, 3 Aug 2018 07:21:41 -0600 (MDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=elyograg.org; h= content-language:content-transfer-encoding:content-type :content-type:in-reply-to:mime-version:user-agent:date:date :message-id:from:from:references:subject:subject:received :received; s=mail; t=1533302500; bh=nwOctldMNrX6ifJQB7Zzm4d2rTVz zRoZqY1CI/K2YHA=; b=VY1hVqYLrkdQjyOHAiHpdLZaqVlIRJXnTurmWNhcHkOb qrlRm9IJpHa8emWTZ395E4ynulgi0hyUBE1pDl6rMv9uLA0xIgfo89+PmSh77/xa fgwM7PjCoz/Z5zsdKAfmh97impx/deO3oW2ygz3yMsAfaGKRYQcnkuM1aSSmVEk= X-Virus-Scanned: Debian amavisd-new at frodo.elyograg.org Received: from frodo.elyograg.org ([127.0.0.1]) by localhost (frodo.elyograg.org [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id ORW6AJhOCnzG for ; Fri, 3 Aug 2018 07:21:40 -0600 (MDT) Received: from [192.168.1.114] (114.int.elyograg.org [192.168.1.114]) (using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: elyograg@elyograg.org) by frodo.elyograg.org (Postfix) with ESMTPSA id C9143BE9 for ; Fri, 3 Aug 2018 07:21:40 -0600 (MDT) Subject: Re: Support multiple language tokens in same field To: solr-user@lucene.apache.org References: From: Shawn Heisey Message-ID: Date: Fri, 3 Aug 2018 07:21:42 -0600 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US On 8/3/2018 1:10 AM, Nitesh Kumar wrote: > As I discussed above, in some special case, we have a situation where > these fields ( field1, field2 etc..) value can be in *CJK *pattern. That > means field1, field2 store plain *English *text or *CJK *text. Hence, in > case of choosing *StandardTokenizer, *while indexing/query it works fine > when it has to deal with plain *English text*, whereas in the case of *CJK > text *it doesn't work appropriately. We have one index where fields can contain both English and CJK.  The customer is in Japan.  I designed it to work properly with all CJK characters, not just Japanese. This is the fieldType I came up with after a LOT of research.  Most of the information that was useful came from a series of blog posts: https://apaste.info/Vfwf I used a paste website because line wrapping within an email would have made it difficult to copy.  The paste expires in one month. This analysis chain uses the ICU classes that are included as a contrib module with Solr, as well as one custom jar: https://github.com/sul-dlss/CJKFoldingFilter/blob/master/src/edu/stanford/lucene/analysis/CJKFoldingFilterFactory.java The blog posts I used to create my schema can be found here: http://discovery-grindstone.blogspot.com/2014/ Some people might find the ICUFoldingFilterFactory too aggressive.  If so, replace it with ASCIIFoldingFilterFactory and ICUNormalizer2FilterFactory.  This is what we're actually using -- the customer didn't want the kinds of matches that the ICU class allowed. Using edismax with an unusual value for the "mm" parameter might solve some of your other issues.  This is discussed in parts 8 and 12 of the blog series. I have one note for you about your analysis chain.  I notice you have a filter listed before the tokenizer.  Solr will always apply the tokenizer first -- the ASCIIFoldingFilterFactory that you have listed first is in fact being run second.  Solr will always run CharFilter entries first, then the tokenizer, then Filter entries. Thanks, Shawn