lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: Differentiating user search term in Solr
Date Tue, 21 Apr 2015 01:01:06 GMT
I’ve been wanting a “free text” query parser for a while. We could build some cool stuff
on that: auto-phrasing, entity extraction and weighting, CJK tokenization, …

For reference, here are some real-world user queries I have needed to deal with. These have
exactly matched content.

* +/-
* .hack//Roots
* p=mv

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Apr 20, 2015, at 5:52 PM, Steven White <swhite4141@gmail.com> wrote:

> Hi Erick,
> 
> I think you missed my point.  My request is, Solr support a new URL
> parameter.  If this parameter is set, than EVERYTHING in q is treated as
> raw text (i.e.: Solr will do the escaping vs. the client).
> 
> Thanks
> 
> Steve
> 
> On Mon, Apr 20, 2015 at 1:08 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
> 
>> How does that address the example query I gave?
>> 
>> q=field1:whatever AND (a AND field:b) OR (field2:c AND "d: is a letter
>> followed by a colon (:)").
>> 
>> bq: "Solr will treat everything in the search string by first passing
>> it to ClientUtils.escapeQueryChars()."
>> 
>> would incorrectly escape the colons after field1, field, field2 and
>> correctly escape the colon after d and in parens. And parens are a
>> reserved character too, so it would incorrectly escape _all_ the
>> parens except the ones surrounding the colon.
>> 
>> The list of reserved characters is pretty unchanging, so I don't think
>> it's too much to ask the app layer, which knows (at least it better
>> know) which bits of the query were user entered, what rules apply as
>> to whether the user can enter field-qualified searches etc. Only armed
>> with that knowledge can the right thing be done, and Solr has no
>> knowledge of those rules.
>> 
>> If you insist that the client shouldn't deal with that, you could
>> always write a custom component that enforces the rules that are
>> particular to your setup. For instance, you may have a rule that you
>> can never field-qualify any term, in which case escaping on the Solr
>> side would work in _your_ situation. But the general case just doesn't
>> fit into the "escape on the Solr side" paradigm.
>> 
>> Best,
>> Erick
>> 
>> 
>> On Mon, Apr 20, 2015 at 9:55 AM, Steven White <swhite4141@gmail.com>
>> wrote:
>>> Hi Erick,
>>> 
>>> I didn't know about ClientUtils.escapeQueryChars(), this is good to know.
>>> Unfortunately I cannot use it because it means I have to import Solr
>>> classes with my client application.  I want to avoid that and create a
>>> lose coupling between my application and Solr (just rely on REST).
>>> 
>>> My suggestion is to add a new URL parameter to Solr, such as
>>> "q.ignoreOperators=[true | false]" (or some other name).  If this
>> parameter
>>> is set to "false" or is missing, than the current behavior takes effect,
>> if
>>> it is set to "true" than Solr will treat everything in the search string
>> by
>>> first passing it to ClientUtils.escapeQueryChars().  This way, the client
>>> application doesn't have to: a) be tightly coupled with Solr (require to
>>> link with Solr JARs to use escapeQueryChars), and b) keep up with Solr
>> when
>>> new operators are added.
>>> 
>>> What do you think?
>>> 
>>> Steve
>>> 
>>> On Mon, Apr 20, 2015 at 12:41 PM, Erick Erickson <
>> erickerickson@gmail.com>
>>> wrote:
>>> 
>>>> Steve:
>>>> 
>>>> In short, no. There's no good way for Solr to solve this problem in
>>>> the _general_ case. Well, actually we could create parsers with rules
>>>> like "if the colon is inside a paren, escape it). Which would
>>>> completely break someone who wants to form queries like
>>>> 
>>>> q=field1:whatever AND (a AND field:b) OR (field2:c AND "d: is a letter
>>>> followed by a colon (:)").
>>>> 
>>>> You say: " A better solution would be to have Solr support a new
>>>> parameter that I can pass to Solr as part of the URL."
>>>> 
>>>> How would Solr know _which_ parts of the URL to escape in the case
>> above?
>>>> 
>>>> You have to do this at the app layer as that's the only place that has
>>>> a clue what the peculiarities of the situation are.
>>>> 
>>>> But if you're using SolrJ in your app layer, you can use
>>>> ClientUtils.escapeQueryChars() for user-entered data to do the
>>>> escaping without you having to maintain a separate list.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> On Mon, Apr 20, 2015 at 8:39 AM, Steven White <swhite4141@gmail.com>
>>>> wrote:
>>>>> Hi Shawn,
>>>>> 
>>>>> If the user types "title:(Apache: Solr Notes)" (without quotes) than
I
>>>> want
>>>>> Solr to treat the whole string as raw text string as if I escaped ":",
>>>> "("
>>>>> and ")" and any other reserved Solr keywords / tokens.  Using dismax
>> it
>>>>> worked for the ":" case, but I still get SyntaxError if I pass it the
>>>>> following "title:(Apache: Solr Notes) AND" (here is the full URL):
>>>>> 
>>>>> 
>>>>> 
>>>> 
>> http://localhost:8983/solr/db/select?q=title:(Apache:%20Solr%20Notes)%20AND&fl=id%2Cscore%2Ctitle&wt=xml&indent=true&q.op=AND&defType=dismax&qf=title
>>>>> 
>>>>> So far, the only solution I can find is for my application to escape
>> all
>>>>> Solr operators before sending the string to Solr.  This is fine, but
>> it
>>>>> means my application will have to adopt to Solr's reserved operators
>> as
>>>>> Solr grows (if Solr 5.x / 6.x adds a new operator, I have to add that
>> to
>>>> my
>>>>> applications escape list).  A better solution would be to have Solr
>>>> support
>>>>> a new parameter that I can pass to Solr as part of the URL.
>>>>> This parameter will tell Solr to do the escaping for me or not
>> (missing
>>>>> means the same as don't do the escaping).
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Steve
>>>>> 
>>>>> On Mon, Apr 20, 2015 at 10:05 AM, Shawn Heisey <apache@elyograg.org>
>>>> wrote:
>>>>> 
>>>>>> On 4/20/2015 7:41 AM, Steven White wrote:
>>>>>>> In my application, a user types "Apache Solr Notes".  I take
that
>> text
>>>>>> and
>>>>>>> send it over to Solr like so:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>> http://localhost:8983/solr/db/select?q=title:(Apache%20Solr%20Notes)&fl=id%2Cscore%2Ctitle&wt=xml&indent=true&q.op=AND
>>>>>>> 
>>>>>>> And I get a hit on "Apache Solr Release Notes".  This is all
good.
>>>>>>> 
>>>>>>> Now if the same user types "Apache: Solr Notes" (notice the ":"
>> after
>>>>>>> "Apache") I will get a SyntaxError.  The fix is to escape ":"
>> before I
>>>>>> send
>>>>>>> it to Solr.  What I want to figure out is how can I tell Solr
/
>>>> Lucene to
>>>>>>> ignore ":" and escape it for me?  In this example, I used ":"
but
>> my
>>>> need
>>>>>>> is for all other operators and reserved Solr / Lucene characters.
>>>>>> 
>>>>>> If we assume that what you did for the first query is what you will
>> do
>>>>>> for the second query, then this is what you would have sent:
>>>>>> 
>>>>>> q=title:(Apache: Solr Notes)
>>>>>> 
>>>>>> How is the parser supposed to know that only the second colon should
>> be
>>>>>> escaped, and not the first one?  If you escape them both (or treat
>> the
>>>>>> entire query string as query text), then the fact that you are
>> searching
>>>>>> the "title" field is lost.  The text "title" becomes an actual part
>> of
>>>>>> the query, and may not match, depending on what you have done with
>> other
>>>>>> parameters, such as the default operator.
>>>>>> 
>>>>>> If you use the dismax parser (*NOT* the edismax parser, which parses
>>>>>> field:value queries and boolean operator syntax just like the lucene
>>>>>> parser), you may be able to achieve what you're after.
>>>>>> 
>>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
>>>>>> https://wiki.apache.org/solr/DisMaxQParserPlugin
>>>>>> 
>>>>>> With dismax, you would use the qf and possibly the pf parameter to
>> tell
>>>>>> it which fields to search and send this as the query:
>>>>>> 
>>>>>> q=Apache: Solr Notes
>>>>>> 
>>>>>> Thanks,
>>>>>> Shawn
>>>>>> 
>>>>>> 
>>>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message