lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandeep Khanzode <sandeep_khanz...@yahoo.com.INVALID>
Subject Re: Wildcard searches with space in TextField/StrField
Date Fri, 25 Nov 2016 13:59:00 GMT
Hi All,

Can someone please assist with this query?

My data consists of:
1.] John Doe
2.] John V. Doe
3.] Johnson Doe
4.] Johnson V. Doe
5.] John Smith
6.] Johnson V. Smith
7.] Matt Doe
8.] Matt V. Doe
9.] Matt Doe
10.] Matthew V. Doe
11.] Matthew Smith

12.] Matthew V. Smith

Querying ...
(a) Matt/Matt* should return records 7-12
(b) John/John* should return records 1-6
(c) Doe/Doe* should return records 1-4, 7-10
(d) Smith/Smith* should return records 5,6,11,12
(e) V/V./V.*/V* should return records 2,4,6,8,10,12
(f) V. Doe/V. Doe* should return records 2,4,8,10
(g) John V/John V./John V*/John V.* should return record 2
(h) V. Smith/V. Smith* should return records 6,12

Any guidance would be appreciated!
I have tried ComplexPhraseQueryParser, but with a single token like Doe*, there is an error
that indicates that the query is being identified as a prefix query. I may be missing something
in the syntax.
 SRK 

    On Thursday, November 24, 2016 11:16 PM, Sandeep Khanzode <sandeep_khanzode@yahoo.com.INVALID>
wrote:
 

 Hi All, Erick,
Please suggest. Would like to use the ComplexPhraseQueryParser for searching text (with wildcard)
that may contain special characters.
For example ...John* should match John V. DoeJohn* should match Johnson SmithBruce-Willis*
should match Bruce-WillisV.* should match John V. F. Doe
SRK 

    On Thursday, November 24, 2016 5:57 PM, Sandeep Khanzode <sandeep_khanzode@yahoo.com.INVALID>
wrote:
 

 Hi,
This is the typical TextField with ...   <fieldType name="text123" class="solr.TextField"
positionIncrementGap="100">    <analyzer>      <tokenizer class="solr.StandardTokenizerFactory"/> 
    <filter class="solr.LowerCaseFilterFactory"/>    </analyzer>  </fieldType>



SRK 

    On Thursday, November 24, 2016 1:38 AM, Reth RM <reth.iksam@gmail.com> wrote:
 

 what is the fieldType of those records?  
On Tue, Nov 22, 2016 at 4:18 AM, Sandeep Khanzode <sandeep_khanzode@yahoo.com.invalid>
wrote:

Hi Erick,
I gave this a try. 
These are my results. There is a record with "John D. Smith", and another named "John Doe".

1.] {!complexphrase inOrder=true}name:"John D.*" ... does not fetch any results. 

2.] {!complexphrase inOrder=true}name:"John D*" ... fetches both results. 



Second observation: There is a record with "John D Smith"
1.] {!complexphrase inOrder=true}name:"John*" ... does not fetch any results. 

2.] {!complexphrase inOrder=true}name:"John D*" ... fetches that record. 

3.] {!complexphrase inOrder=true}name:"John D S*" ... fetches that record. 

SRK

    On Sunday, November 13, 2016 7:43 AM, Erick Erickson <erickerickson@gmail.com>
wrote:


 Right, for that kind of use case you want complexPhraseQueryParser,
see: https://cwiki.apache.org/ confluence/display/solr/Other+ Parsers#OtherParsers- ComplexPhraseQueryParser

Best,
Erick

On Sat, Nov 12, 2016 at 9:39 AM, Sandeep Khanzode
<sandeep_khanzode@yahoo.com> wrote:
> Thanks, Erick.
>
> I am actually not trying to use the String field (prefer a TextField here).
> But, in my comparisons with TextField, it seems that something like phrase
> matching with whitespace and wildcard (like, 'my do*' or say, 'my dog*', or
> say, 'my dog has*') can only be accomplished with a string type field,
> especially because, with a WhitespaceTokenizer in TextField, the space will
> be lost, and all tokens will be individually considered. Am I missing
> something?
>
> SRK
>
>
> On Friday, November 11, 2016 10:05 PM, Erick Erickson
> <erickerickson@gmail.com> wrote:
>
>
> You have to query text and string fields differently, that's just the
> way it works. The problem is getting the query string through the
> parser as a _single_ token or as multiple tokens.
>
> Let's say you have a string field with the "a b" example. You have a
> single token
> a b that starts at offset 0.
>
> But with a text field, you have two tokens,
> a at position 0
> b at position 1
>
> But when the query parser sees "a b" (without quotes) it splits it
> into two tokens, and only the text field has both tokens so the string
> field won't match.
>
> OTOH, when the query parser sees "a\ b" it passes this through as a
> single token, which only matches the string field as there's no
> _single_ token "a b" in the text field.
>
> But a more interesting question is why you want to search this way.
> String fields are intended for keywords, machine-generated IDs and the
> like. They're pretty useless for searching anything except
> 1> exact tokens
> 2> prefixes
>
> While if you have "my dog has fleas" in a string field, you _can_
> search "*dog*" and get a hit but the performance is poor when you get
> a large corpus. Performance for "my*" will be pretty good though.
>
> In all this sounds like an XY problem, what's the use-case you're
> trying to solve?
>
> Best,
> Erick
>
>
>
> On Thu, Nov 10, 2016 at 10:11 PM, Sandeep Khanzode
> <sandeep_khanzode@yahoo.com. invalid> wrote:
>> Hi Erick, Reth,
>>
>> The 'a\ b*' as well as the q.op=AND approach worked (successfully) only
>> for StrField for me.
>>
>> Any attempt at creating a 'a\ b*' for a TextField does not match any
>> documents. The parsedQuery in debug mode does show 'field:a b*'. I am sure
>> there are documents that should match.
>> Another (maybe unrelated) observation is if I have 'field:a\ b', then the
>> parsedQuery is field:a field:b. Which does not match as expected (matches
>> individually).
>>
>> Can you please provide an example that I can use in Solr Query dashboard?
>> That will be helpful.
>>
>> I have also seen that wildcard queries work irrespective of field type
>> i.e. StrField as well as TextField. That makes sense because with a
>> WhitespaceTokenizer only creates word boundaries when we do not use a
>> EdgeNGramFilter. If I am not wrong, that is. SRK
>>
>>    On Friday, November 11, 2016 5:00 AM, Erick Erickson
>> <erickerickson@gmail.com> wrote:
>>
>>
>>  You can escape the space with a backslash as  'a\ b*'
>>
>> Best,
>> Erick
>>
>> On Thu, Nov 10, 2016 at 2:37 PM, Reth RM <reth.iksam@gmail.com> wrote:
>>> I don't think you can do wildcard on StrField. For text field, if your
>>> query is "category:(test m*)"  the parsed query will be  "category:test
>>> OR
>>> category:m*"
>>> You can add q.op=AND to make an AND between those terms.
>>>
>>> For phrase type wild card query support, as per docs, it
>>> is ComplexPhraseQueryParser that supports it. (I haven't tested it
>>> myself)
>>>
>>>
>>> https://cwiki.apache.org/ confluence/display/solr/Other+ Parsers#OtherParsers-
ComplexPhraseQueryParser
>>>
>>> On Thu, Nov 10, 2016 at 11:40 AM, Sandeep Khanzode <
>>> sandeep_khanzode@yahoo.com. invalid> wrote:
>>>
>>>> Hi,
>>>> How does a search like abc* work in StrField. Since the entire thing is
>>>> stored as a single token, is it a type of a trie structure that allows
>>>> such
>>>> wildcard matching?
>>>> How can searches with space like 'a b*' be executed for text fields
>>>> (tokenized on whitespace)? If we specify this type of query, it is
>>>> broken
>>>> down into two queries with field:a and field:b*. I would like them to be
>>>> contiguous, sort of, like a phrase search with wild card.
>>>> SRK
>>
>>
>>
>
>


   



  

  

   
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message