lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Rochkind <rochk...@jhu.edu>
Subject Re: WordDelimiter filter, expanding to multiple words, unexpected results
Date Tue, 30 Dec 2014 17:33:13 GMT
Okay, thanks. I'm not sure if it's my lack of understanding, but I feel 
like I'm having a very hard time getting straight answers out of you 
all, here.

I want the query "mixedCase" to match both/either "mixed Case" and 
"mixedCase" in the index.

What configuration of WDF at index/query time would do this?

This isn't neccesarily the only thing I want WDF to do, but it's 
something I want it to do and thought it was doing and found out it 
wasn't. So we can isolate/simplify to there -- if I can figure out what 
WDF configuration (if any?) can do that first, then I can always move on 
to figuring out how/if that impacts the other things I want WDF to do.

So is there a WDF configuration that can do that? Or is the problem that 
it's confusing, and none of you all are sure either if there is what it 
would be, it's not clear?

Jonathan

On 12/30/14 12:02 PM, Jack Krupansky wrote:
> I do have a more thorough discussion of WDF in my Solr Deep Dive e-book:
> http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html
>
> You're not "wrong" about anything here... you just need to accept that WDF
> is not magic and can't handle every use can that anybody can imagine.
>
> And you do need to be careful about interactions between the query parser
> and the analyzers, especially in these kinds of cases where a single term
> might generate multiple terms.
>
> Some of these features really are only suitable for advanced, "expert"
> users.
>
> Note that one of the features that Solr is missing is support for the
> Google-like feature of splitting concatenated words (regardless of case.)
> That's worthy of a Jira.
>
>
> -- Jack Krupansky
>
> On Tue, Dec 30, 2014 at 11:44 AM, Jonathan Rochkind <rochkind@jhu.edu>
> wrote:
>
>> I guess I don't understand what the four use cases are, or the three out
>> of four use cases, or whatever. What the intended uses of the WDF are.
>>
>> Can you explain what the intended use of setting:
>>
>> generateWordParts="1" catenateWords="1" splitOnCaseChange="1"
>>
>> Is that supposed to do something useful (at either query or index time),
>> or is that a nonsensical configuration that nobody should ever use?
>>
>> I understand how analysis can be different at index vs query time. I think
>> what I don't fully understand is what the possibilities and intended use
>> case of the WDF are, with various configurations.
>>
>> I thought one of the intended use cases, with appropriate configuration,
>> was to do what I'm talking: allow "mixedCase" query to match both "mixed
>> Case" and "mixed Case" in the index. I think you're saying I'm wrong, and
>> this is not something WDF can do? Can you confirm I understand you right?
>>
>> Thanks!
>>
>> Jonathan
>>
>>
>> On 12/30/14 11:30 AM, Jack Krupansky wrote:
>>
>>> Right, that's what I meant by WDF not being "magic" - you can configure it
>>> to match any three out of four use cases as you choose, but there is no
>>> choice that matches all of the use cases.
>>>
>>> To be clear, this is not a "bug" in WDF, but simply a limitation.
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind <rochkind@jhu.edu>
>>> wrote:
>>>
>>>   Thanks Erick!
>>>>
>>>> Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
>>>> query for "mixedCase" will no longer also match "mixed Case".
>>>>
>>>> I think I want WDF to... kind of do all of the above.
>>>>
>>>> Specifically, I had thought that it would allow a query for "mixedCase"
>>>> to
>>>> match both/either "mixed Case" or "mixedCase" in the index. (with case
>>>> insensitivity on top of that via another filter).
>>>>
>>>> That would support things like names like "duBois" which are sometimes
>>>> spelled "du bois" and sometimes "dubois", and allow the query "duBois" to
>>>> match both in the index.
>>>>
>>>> I had somehow thought that was what WDF was intended for. But it's
>>>> actually not the usual functioning, and may not be realistic?
>>>>
>>>> I'm a bit confused about what splitOnCaseChange combined with
>>>> catenateWords is meant to do at all.  It _is_ generating both the split
>>>> and
>>>> single-word tokens at query time -- but not in a way that actually allows
>>>> it to match both the split and single-word tokens?  What is supposed to
>>>> be
>>>> the purpose/use case for splitOnCaseChange with catenateWords? If any?
>>>>
>>>> Jonathan
>>>>
>>>>
>>>> On 12/29/14 7:20 PM, Erick Erickson wrote:
>>>>
>>>>   Jonathan:
>>>>>
>>>>> Well, it works if you set splitOnCaseChange="0" in just the query part
>>>>> of the analysis chain. I probably mislead you a bit months ago, WDFF
>>>>> is intended for this case iff you expect the case change to generate
>>>>> _tokens_ that are individually meaningful.. And unfortunately
>>>>> "significant" in one case will be not-significant in others.
>>>>>
>>>>> So what kinds of things do you want WDFF to handle? Case changes?
>>>>> Letter/non-letter transitions? All of the above?
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind <rochkind@jhu.edu>
>>>>> wrote:
>>>>>
>>>>>   On 12/29/14 5:24 PM, Jack Krupansky wrote:
>>>>>>
>>>>>>
>>>>>>> WDF is powerful, but it is not magic. In general, the indexed
data is
>>>>>>> expected to be clean while the query might be sloppy. You need
to
>>>>>>> separate
>>>>>>> the index and query analyzers and they need to respect that
>>>>>>> distinction
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I do not understand what separate query/index analysis you are
>>>>>> suggesting to
>>>>>> accomplish what I wanted.
>>>>>>
>>>>>> I understand the WDF, like all software, is not magic, of course.
But I
>>>>>> thought this was an intended use case of the WDF, with those settings:
>>>>>>
>>>>>> A "mixedCase" query would match "mixedCase" in the index; and the
same
>>>>>> query
>>>>>> "mixedCase" would also match two separate words "mixed Case" in index.
>>>>>> (Case insensitively since I apply an ICUFoldingFilter on top of that).
>>>>>>
>>>>>> Was I wrong, is this not an intended thing for the WDF to do? Or
do I
>>>>>> just
>>>>>> have the wrong configuration options for it to do it? Or is it a
bug?
>>>>>>
>>>>>> When I started this thread a few months ago, I think Erick Erickson
>>>>>> agreed
>>>>>> this was an intended use case for the WDF, but maybe I explained
it
>>>>>> poorly.
>>>>>> Erick if you're around and want to at least confirm whether WDF is
>>>>>> supposed
>>>>>> to do this in your understanding, that would be great!
>>>>>>
>>>>>> Jonathan
>>>>>>
>>>>>>
>>>>>
>>>
>

Mime
View raw message