Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DF790DC4D for ; Wed, 13 Mar 2013 18:09:43 +0000 (UTC) Received: (qmail 2380 invoked by uid 500); 13 Mar 2013 18:09:40 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 2328 invoked by uid 500); 13 Mar 2013 18:09:40 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 2320 invoked by uid 99); 13 Mar 2013 18:09:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Mar 2013 18:09:40 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [192.220.101.25] (HELO wunderwood.org) (192.220.101.25) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Mar 2013 18:09:35 +0000 Received: (qmail 24147 invoked by uid 25881); 13 Mar 2013 18:09:14 -0000 Received: from unknown (HELO scml-wunder.chegg.com) ([199.58.143.128]) (envelope-sender ) by 192.220.101.25 (qmail-ldap-1.03) with AES128-SHA encrypted SMTP for ; 13 Mar 2013 18:09:14 -0000 From: Walter Underwood Mime-Version: 1.0 (Apple Message framework v1283) Content-Type: multipart/alternative; boundary="Apple-Mail=_CD821159-042F-4FC4-8DC1-2DC907F798B8" Subject: Re: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB) Date: Wed, 13 Mar 2013 11:09:15 -0700 In-Reply-To: To: solr-user@lucene.apache.org References: <1363188487.98046.YahooMailClassic@web125304.mail.ne1.yahoo.com> <6BBC644A-E920-4411-B298-72737FDDAE6B@wunderwood.org> Message-Id: X-Mailer: Apple Mail (2.1283) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_CD821159-042F-4FC4-8DC1-2DC907F798B8 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Yeah, the Ultraseek highlighter did not highlight standalone stopwords. = It did highlight stopwords in phrases. That is the "vitamin a" test. wunder On Mar 13, 2013, at 8:55 AM, Burgmans, Tom wrote: > The main reason of using stopwords is to speed up query performance, = since we see that a huge part is consumed by highlighting stopwords. = Also when reading the full highlighted document, we think that it makes = a document better readable when only meaningful words are highlighted. >=20 > For searching in fact I like to keep stopwords... >=20 >=20 > -----Original Message----- > From: Walter Underwood [mailto:wunder@wunderwood.org] > Sent: Wednesday 13 March 2013 04:43 > To: solr-user@lucene.apache.org > Subject: [SPAM] Re: strange edismax parsing when searching in multiple = fields (#TB) > Importance: Low >=20 > Or don't use stopwords. I haven't used stopwords for, oh, a dozen = years or so. >=20 > Removing stopwords was a hack developed for 16-bit computers and 40 = megabyte disks. We don't need to do that any more. >=20 > wunder >=20 > On Mar 13, 2013, at 8:28 AM, Ahmet Arslan wrote: >=20 >> I would merge stop_en.txt and stop_fr.txt. Use same set of stop words = for all fields that you search on. >>=20 >> You might find this useful : = http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/ >>=20 >> --- On Wed, 3/13/13, Burgmans, Tom = wrote: >>=20 >>> From: Burgmans, Tom >>> Subject: strange edismax parsing when searching in multiple fields = (#TB) >>> To: "solr-user@lucene.apache.org" >>> Date: Wednesday, March 13, 2013, 5:22 PM >>> Hi group, >>>=20 >>> Background: >>> I have a collection containing English and French documents. >>> I made sure to index the English content in field "body" >>> (fieldType=3Dtext_en) and the French content in field >>> "body_fr" (fieldType=3Dtext_fr). >>>=20 >>> The user could be either English of French so the goal is to >>> execute the queries against both fields simultaneously >>> without knowing the query language upfront. The query is >>> analyzed differently for each field. For both fields a >>> stopFilter is configured with each its own list of stopwords >>> (different per language). >>>=20 >>> The issue: >>> When I search for 'a result' (without single quotes) in >>> field "body" and "body_fr" at the same time, then "a" is >>> considered a stopword in English and removed for field >>> "body", but not in French so both terms are still searched >>> inside "body_fr". What happens is that the query is parsed >>> (edismax) into this construction: >>>=20 >>> ((body_fr:a)~1.0 (body:result | body_fr:result)~1.0) >>>=20 >>> This query returns only French documents, although there are >>> many English documents in the index that contain the term >>> 'result' as well. How can that happen? I think it is related >>> to the way my query is parsed: there seems to be an >>> AND-relationship between (body_fr:a) and (body:result | >>> body_fr:result). There is no English document that has >>> (body_fr:a), so that's why they don't show up. For me a much >>> more logic parsed query would be: >>>=20 >>> ((body:result)~1.0 | (body_fr:a body_fr:result)~1.0) >>>=20 >>> How should I interpret this? Is it a bug in edismax? Is it >>> intended and if yes: why? >>>=20 >>> Thanks for any hint, >>> Tom >>>=20 >>> This email and any attachments may contain confidential or >>> privileged information >>> and is intended for the addressee only. If you are not the >>> intended recipient, please >>> immediately notify us by email or telephone and delete the >>> original email and attachments >>> without using, disseminating or reproducing its contents to >>> anyone other than the intended >>> recipient. Wolters Kluwer shall not be liable for the >>> incorrect or incomplete transmission of >>> of this email or any attachments, nor for unauthorized use >>> by its employees. >>>=20 >>> Wolters Kluwer nv has its registered address in Alphen aan >>> den Rijn, The Netherlands, and is registered >>> with the Trade Registry of the Dutch Chamber of Commerce >>> under number 33202517. >>>=20 >=20 > -- > Walter Underwood > wunder@wunderwood.org >=20 >=20 >=20 >=20 > This email and any attachments may contain confidential or privileged = information > and is intended for the addressee only. If you are not the intended = recipient, please > immediately notify us by email or telephone and delete the original = email and attachments > without using, disseminating or reproducing its contents to anyone = other than the intended > recipient. Wolters Kluwer shall not be liable for the incorrect or = incomplete transmission of > of this email or any attachments, nor for unauthorized use by its = employees. >=20 > Wolters Kluwer nv has its registered address in Alphen aan den Rijn, = The Netherlands, and is registered > with the Trade Registry of the Dutch Chamber of Commerce under number = 33202517. -- Walter Underwood wunder@wunderwood.org --Apple-Mail=_CD821159-042F-4FC4-8DC1-2DC907F798B8--