Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: Walter Underwood <wunder@wunderwood.org>
Mime-Version: 1.0 (Apple Message framework v1283)
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_CD821159-042F-4FC4-8DC1-2DC907F798B8"
Subject: Re: [SPAM]  Re: strange edismax parsing when searching in multiple
 fields (#TB)
Date: Wed, 13 Mar 2013 11:09:15 -0700
In-Reply-To: 
 <E189ABF0E31C424E98DF49EDFD92D25D046850A64E@EUSRVWK01005.eu.wkeurope.com>
To: solr-user@lucene.apache.org
References: <1363188487.98046.YahooMailClassic@web125304.mail.ne1.yahoo.com>
 <6BBC644A-E920-4411-B298-72737FDDAE6B@wunderwood.org>
 <E189ABF0E31C424E98DF49EDFD92D25D046850A64E@EUSRVWK01005.eu.wkeurope.com>
Message-Id: <F8E21573-684F-4CBD-962F-A846FAD1837C@wunderwood.org>

--Apple-Mail=_CD821159-042F-4FC4-8DC1-2DC907F798B8
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

Yeah, the Ultraseek highlighter did not highlight standalone stopwords. =
It did highlight stopwords in phrases. That is the "vitamin a" test.

wunder

On Mar 13, 2013, at 8:55 AM, Burgmans, Tom wrote:

> The main reason of using stopwords is to speed up query performance, =
since we see that a huge part is consumed by highlighting stopwords. =
Also when reading the full highlighted document, we think that it makes =
a document better readable when only meaningful words are highlighted.
>=20
> For searching in fact I like to keep stopwords...
>=20
>=20
> -----Original Message-----
> From: Walter Underwood [mailto:wunder@wunderwood.org]
> Sent: Wednesday 13 March 2013 04:43
> To: solr-user@lucene.apache.org
> Subject: [SPAM] Re: strange edismax parsing when searching in multiple =
fields (#TB)
> Importance: Low
>=20
> Or don't use stopwords. I haven't used stopwords for, oh, a dozen =
years or so.
>=20
> Removing stopwords was a hack developed for 16-bit computers and 40 =
megabyte disks. We don't need to do that any more.
>=20
> wunder
>=20
> On Mar 13, 2013, at 8:28 AM, Ahmet Arslan wrote:
>=20
>> I would merge stop_en.txt and stop_fr.txt. Use same set of stop words =
for all fields that you search on.
>>=20
>> You might find this useful : =
http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
>>=20
>> --- On Wed, 3/13/13, Burgmans, Tom <tom.burgmans@wolterskluwer.com> =
wrote:
>>=20
>>> From: Burgmans, Tom <tom.burgmans@wolterskluwer.com>
>>> Subject: strange edismax parsing when searching in multiple fields =
(#TB)
>>> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
>>> Date: Wednesday, March 13, 2013, 5:22 PM
>>> Hi group,
>>>=20
>>> Background:
>>> I have a collection containing English and French documents.
>>> I made sure to index the English content in field "body"
>>> (fieldType=3Dtext_en) and the French content in field
>>> "body_fr" (fieldType=3Dtext_fr).
>>>=20
>>> The user could be either English of French so the goal is to
>>> execute the queries against both fields simultaneously
>>> without knowing the query language upfront. The query is
>>> analyzed differently for each field. For both fields a
>>> stopFilter is configured with each its own list of stopwords
>>> (different per language).
>>>=20
>>> The issue:
>>> When I search for 'a result' (without single quotes) in
>>> field "body" and "body_fr" at the same time, then "a" is
>>> considered a stopword in English and removed for field
>>> "body", but not in French so both terms are still searched
>>> inside "body_fr". What happens is that the query is parsed
>>> (edismax) into this construction:
>>>=20
>>> ((body_fr:a)~1.0 (body:result | body_fr:result)~1.0)
>>>=20
>>> This query returns only French documents, although there are
>>> many English documents in the index that contain the term
>>> 'result' as well. How can that happen? I think it is related
>>> to the way my query is parsed: there seems to be an
>>> AND-relationship between (body_fr:a) and (body:result |
>>> body_fr:result). There is no English document that has
>>> (body_fr:a), so that's why they don't show up. For me a much
>>> more logic parsed query would be:
>>>=20
>>> ((body:result)~1.0 | (body_fr:a body_fr:result)~1.0)
>>>=20
>>> How should I interpret this? Is it a bug in edismax? Is it
>>> intended and if yes: why?
>>>=20
>>> Thanks for any hint,
>>> Tom
>>>=20
>>> This email and any attachments may contain confidential or
>>> privileged information
>>> and is intended for the addressee only. If you are not the
>>> intended recipient, please
>>> immediately notify us by email or telephone and delete the
>>> original email and attachments
>>> without using, disseminating or reproducing its contents to
>>> anyone other than the intended
>>> recipient. Wolters Kluwer shall not be liable for the
>>> incorrect or incomplete transmission of
>>> of this email or any attachments, nor for unauthorized use
>>> by its employees.
>>>=20
>>> Wolters Kluwer nv has its registered address in Alphen aan
>>> den Rijn, The Netherlands, and is registered
>>> with the Trade Registry of the Dutch Chamber of Commerce
>>> under number 33202517.
>>>=20
>=20
> --
> Walter Underwood
> wunder@wunderwood.org
>=20
>=20
>=20
>=20
> This email and any attachments may contain confidential or privileged =
information
> and is intended for the addressee only. If you are not the intended =
recipient, please
> immediately notify us by email or telephone and delete the original =
email and attachments
> without using, disseminating or reproducing its contents to anyone =
other than the intended
> recipient. Wolters Kluwer shall not be liable for the incorrect or =
incomplete transmission of
> of this email or any attachments, nor for unauthorized use by its =
employees.
>=20
> Wolters Kluwer nv has its registered address in Alphen aan den Rijn, =
The Netherlands, and is registered
> with the Trade Registry of the Dutch Chamber of Commerce under number =
33202517.

--
Walter Underwood
wunder@wunderwood.org


--Apple-Mail=_CD821159-042F-4FC4-8DC1-2DC907F798B8--