lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrea Gazzarini <andrea.gazzar...@gmail.com>
Subject Re: Tokenization at query time
Date Tue, 13 Aug 2013 14:26:58 GMT
Trying...thank you very much!

I'll let you know

Best,
Andrea

On 08/13/2013 04:18 PM, Erick Erickson wrote:
> I think you can get what you want by escaping the space with a backslash....
>
> YMMV of course.
> Erick
>
>
> On Tue, Aug 13, 2013 at 9:11 AM, Andrea Gazzarini <
> andrea.gazzarini@gmail.com> wrote:
>
>> Hi Erick,
>> sorry if that wasn't clear: this is what I'm actually observing in my
>> application.
>>
>> I wrote the first post after looking at the explain (debugQuery=true): the
>> query
>>
>> q=mag 778 G 69
>>
>> is translated as follow:
>>
>>
>> /  +((DisjunctionMaxQuery((//**myfield://*mag*//^3000.0)~0.1)
>>        DisjunctionMaxQuery((//**myfield://*778*//^3000.0)~0.1)
>>        DisjunctionMaxQuery((//**myfield://*g*//^3000.0)~0.1)
>>        DisjunctionMaxQuery((//**myfield://*69*//^3000.0)~0.1))**~4)
>>        DisjunctionMaxQuery((//**myfield://*mag778g69*//^30000.**0)~0.1)/
>>
>> It seems that althouhg I declare myfield with this type
>>
>> /<fieldtype name="type1" class="solr.TextField" >
>>
>>      <analyzer>
>>          <tokenizer class="solr.**KeywordTokenizerFactory*" />
>>
>>          <filter class="solr.**LowerCaseFilterFactory" />
>>          <filter class="solr.**WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0"
>>              catenateWords="0" catenateNumbers="0" catenateAll="1"**splitOnCaseChange="0"
>> />
>>      </analyzer>
>> </fieldtype>
>>
>> /SOLR is tokenizing it therefore by producing several tokens
>> (mag,778,g,69)/
>> /
>>
>> And I can't put double quotes on the query (q="mag 778 G 69") because the
>> request handler searches also in other fields (with different configuration
>> chains)
>>
>> As I understood the query parser, (i.e. query time), does a whitespace
>> tokenization on its own before invoking my (query-time) chain. The same
>> doesn't happen at index time...this is my problem...because at index time
>> the field is analyzed exactly as I want...but unfortunately cannot say the
>> same at query time.
>>
>> Sorry for my wonderful english, did you get the point?
>>
>>
>> On 08/13/2013 02:18 PM, Erick Erickson wrote:
>>
>>> On a quick scan I don't see a problem here. Attach
>>> &debug=query to your url and that'll show you the
>>> parsed query, which will in turn show you what's been
>>> pushed through the analysis chain you've defined.
>>>
>>> You haven't stated whether you've tried this and it's
>>> not working or you're looking for guidance as to how
>>> to accomplish this so it's a little unclear how to
>>> respond.
>>>
>>> BTW, the admin/analysis page is your friend here....
>>>
>>> Best
>>> Erick
>>>
>>>
>>> On Mon, Aug 12, 2013 at 12:52 PM, Andrea Gazzarini <
>>> andrea.gazzarini@gmail.com> wrote:
>>>
>>>   Clear, thanks for response.
>>>> So, if I have two fields
>>>>
>>>> <fieldtype name="type1" class="solr.TextField" >
>>>>       <analyzer>
>>>>           <tokenizer class="solr.****KeywordTokenizerFactory*" />
>>>>
>>>>           <filter class="solr.****LowerCaseFilterFactory" />
>>>>           <filter class="solr.****WordDelimiterFilterFactory"
>>>>
>>>> generateWordParts="0" generateNumberParts="0"
>>>>               catenateWords="0" catenateNumbers="0" catenateAll="1"
>>>> splitOnCaseChange="0" />
>>>>       </analyzer>
>>>> </fieldtype>
>>>> <fieldtype name="type2" class="solr.TextField" >
>>>>       <analyzer>
>>>>           <charFilter class="solr.****MappingCharFilterFactory"
>>>> mapping="mapping-FoldToASCII.****txt"/>
>>>>           <tokenizer class="solr.****WhitespaceTokenizerFactory" />
>>>>           <filter class="solr.****LowerCaseFilterFactory" />
>>>>           <filter class="solr.****WordDelimiterFilterFactory" .../>
>>>>
>>>>       </analyzer>
>>>> </fieldtype>
>>>>
>>>> (first field type *Mag. 78 D 99* becomes *mag78d99* while second field
>>>> type ends with several tokens)
>>>>
>>>> And I want to use the same request handler to query against both of them.
>>>> I mean I want the user search something like
>>>>
>>>> http//..../search?q=Mag 78 D 99
>>>>
>>>> and this search should search within both the first (with type1) and
>>>> second (with type 2) by matching
>>>>
>>>> - a document which has field_with_type1 equals to *mag78d99* or
>>>> - a document which has field_with_type2 that contains a text like "go to
>>>> *mag 78*, class *d* and subclass *99*)
>>>>
>>>>
>>>> <requestHandler ....>
>>>>       ...
>>>>       <str name="defType">dismax</str>
>>>>       ...
>>>>       <str name="mm">100%</str>
>>>>       <str name="qf">
>>>>           field_with_type1
>>>>           field_with_type_2
>>>>       </str>
>>>>       ...
>>>> </requestHandler>
>>>>
>>>> is not possible? If so, is possible to do that in some other way?
>>>>
>>>> Sorry for the long email and thanks again
>>>> Andrea
>>>>
>>>>
>>>> On 08/12/2013 04:01 PM, Jack Krupansky wrote:
>>>>
>>>>   Quoted phrases will be passed to the analyzer as one string, so there a
>>>>> white space tokenizer is needed.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> -----Original Message----- From: Andrea Gazzarini
>>>>> Sent: Monday, August 12, 2013 6:52 AM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: Tokenization at query time
>>>>>
>>>>> Hi Tanguy,
>>>>> thanks for fast response. What you are saying corresponds perfectly with
>>>>> the behaviour I'm observing.
>>>>> Now, other than having a big problem (I have several other fields both
>>>>> in the pf and qf where spaces doesn't matter, field types like the
>>>>> "text_en" field type in the example schema) what I'm wondering is:
>>>>>
>>>>> /"The query parser splits the input query on white spaces, and the each
>>>>> token is analysed according to your configuration"//
>>>>> /
>>>>> Is there a valid reason to declare a WhiteSpaceTokenizer in a query
>>>>> analyzer? If the input query is already parsed (i.e. whitespace
>>>>> tokenized) what is its effect?
>>>>>
>>>>> Thank you very much for the help
>>>>> Andrea
>>>>>
>>>>> On 08/12/2013 12:37 PM, Tanguy Moal wrote:
>>>>>
>>>>>   Hello Andrea,
>>>>>> I think you face a rather common issue involving keyword tokenization
>>>>>> and query parsing in Lucene:
>>>>>> The query parser splits the input query on white spaces, and then
each
>>>>>> token is analysed according to your configuration.
>>>>>> So those queries with a whitespace won't behave as expected because
>>>>>> each
>>>>>> token is analysed separately. Consequently, the catenated version
of
>>>>>> the
>>>>>> reference cannot be generated.
>>>>>> I think you could try surrounding your query with double quotes or
>>>>>> escaping the space characters in your query using a backslash so
that
>>>>>> the
>>>>>> whole sequence is analysed in the same analyser and the catenation
>>>>>> occurs.
>>>>>> You should be aware that this approach has a drawback: you will
>>>>>> probably
>>>>>> not be able to combine the search for Mag. 778 G 69 with other words
in
>>>>>> other fields unless you are able to identify which spaces are to
be
>>>>>> escaped:
>>>>>> For example, if input the query is:
>>>>>> Awesome Mag. 778 G 69
>>>>>> you would want to transform it to:
>>>>>> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
>>>>>> or
>>>>>> Awesome "Mag. 778 G 69" // only the reference is turned into a phrase
>>>>>> query
>>>>>>
>>>>>> Do you get the point?
>>>>>>
>>>>>> Look at the differences between what you tried and the following
>>>>>> examples which should all do what you want:
>>>>>> http://localhost:8983/solr/****collection1/select?q=%22Mag.%****
>>>>>> 20778%20G%2069%22&debugQuery=****on&qf=text%20myfield&defType=**
>>>>>> **dismax<http://localhost:**8983/solr/collection1/select?**
>>>>>> q=%22Mag.%20778%20G%2069%22&**debugQuery=on&qf=text%**
>>>>>> 20myfield&defType=dismax<http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax>
>>>>>> OR
>>>>>> http://localhost:8983/solr/****collection1/select?q=myfield:****Mag<http://localhost:8983/solr/**collection1/select?q=myfield:**Mag>
>>>>>> <http://localhost:8983/**solr/collection1/select?q=**myfield:Mag<http://localhost:8983/solr/collection1/select?q=myfield:Mag>
>>>>>> .\%20778\%20G\%2069&****debugQuery=on
>>>>>> OR
>>>>>> http://localhost:8983/solr/****collection1/select?q=Mag<http://localhost:8983/solr/**collection1/select?q=Mag>
>>>>>> <http:**//localhost:8983/solr/**collection1/select?q=Mag<http://localhost:8983/solr/collection1/select?q=Mag>
>>>>>> .\%**20778\%20G\%2069&**debugQuery=**on&qf=text%**
>>>>>> 20myfield&defType=**edismax
>>>>>>
>>>>>>
>>>>>>
>>>>>> I hope this helps
>>>>>>
>>>>>> Tanguy
>>>>>>
>>>>>> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <
>>>>>> andrea.gazzarini@gmail.com> wrote:
>>>>>>
>>>>>>    Hi all,
>>>>>>
>>>>>>> I have a field (among others)in my schema defined like this:
>>>>>>>
>>>>>>> <fieldtype name="mytype" class="solr.TextField"
>>>>>>> positionIncrementGap="100">
>>>>>>>       <analyzer>
>>>>>>>           <tokenizer class="solr.*****KeywordTokenizerFactory*"
/>
>>>>>>>           <filter class="solr.****LowerCaseFilterFactory"
/>
>>>>>>>           <filter class="solr.****WordDelimiterFilterFactory"
>>>>>>>
>>>>>>>               generateWordParts="0"
>>>>>>>               generateNumberParts="0"
>>>>>>>               catenateWords="0"
>>>>>>>               catenateNumbers="0"
>>>>>>>               catenateAll="1"
>>>>>>>               splitOnCaseChange="0" />
>>>>>>>       </analyzer>
>>>>>>> </fieldtype>
>>>>>>>
>>>>>>> <field name="myfield" type="mytype" indexed="true"/>
>>>>>>>
>>>>>>> Basically, both at index and query time the field value is normalized
>>>>>>> like this.
>>>>>>>
>>>>>>> Mag. 778 G 69 => mag778g69
>>>>>>>
>>>>>>> Now, in my solrconfig I'm using a search handler like this:
>>>>>>> fossero solo le sue le gambe
>>>>>>>
>>>>>>> <requestHandler ....>
>>>>>>>       ...
>>>>>>>       <str name="defType">dismax</str>
>>>>>>>       ...
>>>>>>>       <str name="mm">100%</str>
>>>>>>>       <str name="qf">myfield^3000</str>
>>>>>>>       <str name="pf">myfield^30000</str>
>>>>>>>
>>>>>>> </requestHandler>
>>>>>>>
>>>>>>> What I'm expecting is that if I index a document with a value
for my
>>>>>>> field "Mag. 778 G 69", I will be able to get this document by
querying
>>>>>>>
>>>>>>> 1. Mag. 778 G 69
>>>>>>> 2. mag 778 g69
>>>>>>> 3. mag778g69
>>>>>>>
>>>>>>> But that doesn't wotk: i'm able to get the document only and
if only I
>>>>>>> use the "normalized2 form: mag778g69
>>>>>>>
>>>>>>> After doing a little bit of debug, I see that, even I used a
>>>>>>> KeywordTokenizer in my field type declaration, SOLR is doing
>>>>>>> soemthign like
>>>>>>> this:
>>>>>>> /
>>>>>>> // +((DisjunctionMaxQuery((//****myfield://*mag*//^3000.0)~0.1)
>>>>>>> DisjunctionMaxQuery((//****myfield://*778*//^3000.0)~0.1)
>>>>>>> DisjunctionMaxQuery((//****myfield://*g*//^3000.0)~0.1)
>>>>>>> DisjunctionMaxQuery((//****myfield://*69*//^3000.0)~0.1))****~4)
>>>>>>> DisjunctionMaxQuery((//****myfield://*mag778g69*//^30000.****0)~0.1)/
>>>>>>>
>>>>>>>
>>>>>>> That is, it is tokenizing the original query string (mag + 778
+ g +
>>>>>>> 69) and obviously querying the field for separate tokens doesn't
match
>>>>>>> anything (at least this is what I think)
>>>>>>>
>>>>>>> Does anybody could please explain me that?
>>>>>>>
>>>>>>> Thanks in advance
>>>>>>> Andrea
>>>>>>>
>>>>>>>


Mime
View raw message