Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C3A78200D2B for ; Thu, 2 Nov 2017 10:46:49 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id C21E0160BE5; Thu, 2 Nov 2017 09:46:49 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E0BFA1609EE for ; Thu, 2 Nov 2017 10:46:48 +0100 (CET) Received: (qmail 27894 invoked by uid 500); 2 Nov 2017 09:46:47 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 27882 invoked by uid 99); 2 Nov 2017 09:46:46 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Nov 2017 09:46:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id F1FC51A48BB for ; Thu, 2 Nov 2017 09:46:45 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.48 X-Spam-Level: ** X-Spam-Status: No, score=2.48 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=sematext-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id YVWE0cum3kLT for ; Thu, 2 Nov 2017 09:46:44 +0000 (UTC) Received: from mail-wm0-f45.google.com (mail-wm0-f45.google.com [74.125.82.45]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id EE6FD5FCC7 for ; Thu, 2 Nov 2017 09:46:43 +0000 (UTC) Received: by mail-wm0-f45.google.com with SMTP id m72so9712246wmc.1 for ; Thu, 02 Nov 2017 02:46:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sematext-com.20150623.gappssmtp.com; s=20150623; h=from:mime-version:subject:date:references:to:in-reply-to:message-id; bh=6JOsSlUVZfdR7Ag4mOlrszn9ghGBKBhGOyqsE9VlRd4=; b=FoglmQ39bqgQudyWrNBUm4K3ueCsPXbUULyEbxA+MiWKTgWYdDbuZxvgOn/p8i9uZg AV6qvwWSqLO8TqLxxxWFIEetbKk1w4TAXfiDhmnCRPWlKul2d+JybXd3lfcW1dqKQFlQ 8l88PpsiOMBnoHnuzk8fc18fNGDYdrhOHIt0WS4LTgO7s9vsHedEACh7NTRIvcI+1YR7 U9d3G2OtvOM6lkyYRBVMdTnpdeSNRCuxva+EFNKHGgey8L1n7y3Dii2tR/GO1YkzPMkt 8EG95Ub8hOmO5TYox7iXuYMdBDhxNJykkCRY3aSuRaa2Jh9BQI/aYGQv4ZNkjaUoTz1W X5EQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:mime-version:subject:date:references:to :in-reply-to:message-id; bh=6JOsSlUVZfdR7Ag4mOlrszn9ghGBKBhGOyqsE9VlRd4=; b=c/3neE0Ki+aIPTsmVHuNMQPZSimu8ez0IiOOE94K7J1v8BoXQOQhsZkn2lvb3ctxLA zphtXbctzkunuAbVwLyimjbDquCDe+oUxhL8zY7nqCYIDGyK9NzMicvmZKVs+GKJYr0x hyyUQsSeXST+sW8zfVxLlNlggUmTG+WvquDLZAcVAtKy6Hw/KkezSo9UhxkfcBtvNzqE 1/eMWYajDekmK8bx1HCOBzpBM55L1G3CDa3bt6oKtK5wk3fhiN5jSJAZK1/du77ZXzl8 UTNmWSKkyThRSvYwUeX7z+v12fU6zmKT/o7xL+j3vBA+57tx2F3ho/1NkZwGFYKjwFO2 V+Cg== X-Gm-Message-State: AJaThX7SUNLbB2RMX+Oa1SkCSC1HtrPeYZZ5MWjW+oCHh1Ei+kMQ1tD7 QOEZ7WEKTRu2a6Sd13m80FUxOWsWXLQ= X-Google-Smtp-Source: ABhQp+S+2jNpSHWt22afidfhij2VBDIp9KfrRdiVLTE3ghmSJ47lOYcvv9SElJP0ljokaYqV0uaQEg== X-Received: by 10.28.141.211 with SMTP id p202mr993897wmd.61.1509616002636; Thu, 02 Nov 2017 02:46:42 -0700 (PDT) Received: from [192.168.0.102] (cable-77-77-234-249.dynamic.telemach.ba. [77.77.234.249]) by smtp.gmail.com with ESMTPSA id d129sm1565360wmf.34.2017.11.02.02.46.41 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 02 Nov 2017 02:46:41 -0700 (PDT) From: =?utf-8?Q?Emir_Arnautovi=C4=87?= Content-Type: multipart/alternative; boundary="Apple-Mail=_1B84B2A5-F987-4ABB-8DC7-38BB445EECB6" Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Advice on Stemming in Solr Date: Thu, 2 Nov 2017 10:46:35 +0100 References: <7F86BE83-5347-456A-B457-5FE000E51C69@sematext.com> To: solr-user@lucene.apache.org In-Reply-To: Message-Id: X-Mailer: Apple Mail (2.3273) archived-at: Thu, 02 Nov 2017 09:46:50 -0000 --Apple-Mail=_1B84B2A5-F987-4ABB-8DC7-38BB445EECB6 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi Edwin, It seems that it would be best if you do not apply *ing stemming rule at = all. The first idea is to trick stemmer and replace any word that ends = with ing to some nonexisting char combination e.g. =E2=80=98wqx=E2=80=99. = You can use solr.PatternReplaceFilterFactory to do that. You can switch = it back after stemming if want to have proper token in index. HTH, Emir=20 -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 2 Nov 2017, at 03:23, Zheng Lin Edwin Yeo = wrote: >=20 > Hi Emir, >=20 > We do have quite alot of words that should not be stemmed. Currently, = the > KStemFilterFactory are stemming all the non-English words that end = with > "ing" as well. There are quite alot of places and names which ends in > "ing", and all these are being stemmed as well, which leads to an > inaccurate search. >=20 > Regards, > Edwin >=20 >=20 > On 1 November 2017 at 18:20, Emir Arnautovi=C4=87 = > wrote: >=20 >> Hi Edwin, >> If the number of words that should not be stemmed is not high you = could >> use KeywordMarkerFilterFactory to flag those words as keywords and it >> should prevent stemmer from changing them. >> Depending on what you want to achieve, you might not be able to avoid >> using stemmer at indexing time. If you want to find documents that = contain >> only =E2=80=9Cwalking=E2=80=9D with search term =E2=80=9Cwalk=E2=80=9D,= then you have to stem at index >> time. Cases when you use stemming on query time only are rare and = specific. >> If you want to prefer exact matches over stemmed matches, you have to >> index same content with and without stemming and boost matches on = field >> without stemming. >>=20 >> HTH, >> Emir >> -- >> Monitoring - Log Management - Alerting - Anomaly Detection >> Solr & Elasticsearch Consulting Support Training - = http://sematext.com/ >>=20 >>=20 >>=20 >>> On 1 Nov 2017, at 10:11, Zheng Lin Edwin Yeo >> wrote: >>>=20 >>> Hi, >>>=20 >>> We are currently using KStemFilterFactory in Solr, but we found that = it >> is >>> actually doing stemming on non-English words like "ximenting", which = it >>> stem to "ximent". This is not what we wanted. >>>=20 >>> Another option is to use the HunspellStemFilterFactory, but there = are >> some >>> English words like "running", walking" that are not being stemmed. >>>=20 >>> Would like to check, is it advisable to use Stemming at index? Or we >> should >>> not use Stemming at index time, but at query time, do a search for = the >>> stemmed words as well, like for example, if the user search for >> "walking", >>> we will do the search together with "walk", and the actual word of >> walking >>> will have higher weightage. >>>=20 >>> I'm currently using Solr 6.5.1. >>>=20 >>> Regards, >>> Edwin >>=20 >>=20 --Apple-Mail=_1B84B2A5-F987-4ABB-8DC7-38BB445EECB6--