lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)
Date Wed, 13 Mar 2013 18:09:15 GMT
Yeah, the Ultraseek highlighter did not highlight standalone stopwords. It did highlight stopwords
in phrases. That is the "vitamin a" test.

wunder

On Mar 13, 2013, at 8:55 AM, Burgmans, Tom wrote:

> The main reason of using stopwords is to speed up query performance, since we see that
a huge part is consumed by highlighting stopwords. Also when reading the full highlighted
document, we think that it makes a document better readable when only meaningful words are
highlighted.
> 
> For searching in fact I like to keep stopwords...
> 
> 
> -----Original Message-----
> From: Walter Underwood [mailto:wunder@wunderwood.org]
> Sent: Wednesday 13 March 2013 04:43
> To: solr-user@lucene.apache.org
> Subject: [SPAM] Re: strange edismax parsing when searching in multiple fields (#TB)
> Importance: Low
> 
> Or don't use stopwords. I haven't used stopwords for, oh, a dozen years or so.
> 
> Removing stopwords was a hack developed for 16-bit computers and 40 megabyte disks. We
don't need to do that any more.
> 
> wunder
> 
> On Mar 13, 2013, at 8:28 AM, Ahmet Arslan wrote:
> 
>> I would merge stop_en.txt and stop_fr.txt. Use same set of stop words for all fields
that you search on.
>> 
>> You might find this useful : http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
>> 
>> --- On Wed, 3/13/13, Burgmans, Tom <tom.burgmans@wolterskluwer.com> wrote:
>> 
>>> From: Burgmans, Tom <tom.burgmans@wolterskluwer.com>
>>> Subject: strange edismax parsing when searching in multiple fields (#TB)
>>> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
>>> Date: Wednesday, March 13, 2013, 5:22 PM
>>> Hi group,
>>> 
>>> Background:
>>> I have a collection containing English and French documents.
>>> I made sure to index the English content in field "body"
>>> (fieldType=text_en) and the French content in field
>>> "body_fr" (fieldType=text_fr).
>>> 
>>> The user could be either English of French so the goal is to
>>> execute the queries against both fields simultaneously
>>> without knowing the query language upfront. The query is
>>> analyzed differently for each field. For both fields a
>>> stopFilter is configured with each its own list of stopwords
>>> (different per language).
>>> 
>>> The issue:
>>> When I search for 'a result' (without single quotes) in
>>> field "body" and "body_fr" at the same time, then "a" is
>>> considered a stopword in English and removed for field
>>> "body", but not in French so both terms are still searched
>>> inside "body_fr". What happens is that the query is parsed
>>> (edismax) into this construction:
>>> 
>>> ((body_fr:a)~1.0 (body:result | body_fr:result)~1.0)
>>> 
>>> This query returns only French documents, although there are
>>> many English documents in the index that contain the term
>>> 'result' as well. How can that happen? I think it is related
>>> to the way my query is parsed: there seems to be an
>>> AND-relationship between (body_fr:a) and (body:result |
>>> body_fr:result). There is no English document that has
>>> (body_fr:a), so that's why they don't show up. For me a much
>>> more logic parsed query would be:
>>> 
>>> ((body:result)~1.0 | (body_fr:a body_fr:result)~1.0)
>>> 
>>> How should I interpret this? Is it a bug in edismax? Is it
>>> intended and if yes: why?
>>> 
>>> Thanks for any hint,
>>> Tom
>>> 
>>> This email and any attachments may contain confidential or
>>> privileged information
>>> and is intended for the addressee only. If you are not the
>>> intended recipient, please
>>> immediately notify us by email or telephone and delete the
>>> original email and attachments
>>> without using, disseminating or reproducing its contents to
>>> anyone other than the intended
>>> recipient. Wolters Kluwer shall not be liable for the
>>> incorrect or incomplete transmission of
>>> of this email or any attachments, nor for unauthorized use
>>> by its employees.
>>> 
>>> Wolters Kluwer nv has its registered address in Alphen aan
>>> den Rijn, The Netherlands, and is registered
>>> with the Trade Registry of the Dutch Chamber of Commerce
>>> under number 33202517.
>>> 
> 
> --
> Walter Underwood
> wunder@wunderwood.org
> 
> 
> 
> 
> This email and any attachments may contain confidential or privileged information
> and is intended for the addressee only. If you are not the intended recipient, please
> immediately notify us by email or telephone and delete the original email and attachments
> without using, disseminating or reproducing its contents to anyone other than the intended
> recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete transmission
of
> of this email or any attachments, nor for unauthorized use by its employees.
> 
> Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The Netherlands,
and is registered
> with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.

--
Walter Underwood
wunder@wunderwood.org




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message