jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "H. Wilson" <wils...@randdss.com>
Subject Re: Problems with hyphen in JSR-170 XPath query using jcr:contains
Date Tue, 31 Aug 2010 20:47:07 GMT

On 08/31/2010 03:05 AM, Ard Schrijvers wrote:
>
>> Given the following parameters in the repository:
>>
>>    .North.South.East.WestLand
>>    .North.South.East.West_Land
>>    .North.South.East.West Land    //yes that's a space
>>
>> The following exact name, case sensitive queries worked as expected for each
>> of the three parameters:
>>
>>    filter.orJCRExpression ("jcr:like(@" + srchField
>> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");  //case sens.
> jcr:like does not depend on any analyser but on the stored field, so
> this is not strange that it still works.
I expected this too, I just try to be as thorough as possible when 
posting anywhere. I am disappointed enough I haven't figured this out on 
my own.
>> The following exact name query, case insensitive, worked for only the
>> parameter with a fullName with a whitespace character:
>>
>>    filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>
>> The following exact name queries, case insensitive, stopped working for the
>> fullnames WITHOUT a whitespace character:
>>
>>    filter.addContains ( srchField,
>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>
>> Again, the only change I made was to the analyzer, I didn't remove my
>> "workaround" yet, and I just want to confirm I properly changed the analyzer
>> to figure out how the tokens were working. Oh I should note, the output from
>> the Analyzer only showed one Token per field, which I believe is what we
>> were looking for. Which leaves me as perplexed as before.
>>
>> LowerCaseKeywordAnalyzer.java:
>>
>>    ...
>>
>>    public TokenStream tokenStream ( String field, final Reader reader  ) {
>>             System.out.println ("TOKEN STREAM for field: " + field);
>>             TokenStream keywordTokenStream = super.tokenStream (field,
>> reader);
>>
>>         //changed for testing
>>             TokenStream lowerCaseStream =  new LowerCaseFilter (
>> keywordTokenStream ) ;
>>             final Token reusableToken = new Token();
>>             try {
>>                 Token mytoken = lowerCaseStream.next (reusableToken);
>>                 while ( mytoken != null  ) {
>>                     System.out.println ("[" + mytoken.term() + "]");
>>                     mytoken = lowerCaseStream.next (mytoken);
>>                 }
>>                 //lowerCaseStream.reset();  //uncommenting this did not
>> change results.
>>             }
>>             catch  (IOException ioe) {
>>                 System.err.println ("ERROR: " + ioe.toString());
>>             }
>>
> It's a stream!! So, your keywordTokenStream is now empty. Call reset()
> on the keywordTokenStream before using it again.
>
> Regards Ard
>
>>             return (new LowerCaseFilter ( keywordTokenStream ) );
>>         }
>>
>>    ...
I was real excited when I saw your email this morning. However, 
resetting keywordTokenStream as the last line in the "try" resulted in 
no change. I also tried uncommenting the lowerCaseStream.reset line in 
an act of desperation with no difference. I must be missing something 
completely obvious at this point... look at a problem too long and the 
obvious fails to jump out at you...

H. Wilson
>> Thanks.
>>
>> H. Wilson
>>
>> On 08/30/2010 09:38 AM, Ard Schrijvers wrote:
>>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<wilsonh@randdss.com>    wrote:
>>>>   Ard,
>>>>
>>>> You are absolutely right.. and this didn't make sense to me either. I
>>>> think
>>>> I was too worn out from my week and too excited to have code that
>>>> "worked"
>>>> to notice the obvious... this must be a workaround. However, I will need
>>>> a
>>>> little guidance on how to inspect the tokens. I have Luke, but never
>>>> really
>>>> understood how to use it properly. Could you give me a clear list of
>>>> steps,
>>>> or point me to a resource I missed, on how I would go about inspecting
>>>> tokens during insert/search? Thanks.
>>> I'd just print them to your console with Token#term() or use a
>>> debugger . If you do that during indexing and searching, I think you
>>> must see some difference in the token that explains *why* Lucene
>>> doesn't find a hit for your usecase with spaces.
>>>
>>> Luke is hard to use for the multi-index jackrabbit indexing, as well
>>> as the field value prefixing: It is unfortunate and not completely
>>> necessary any more but has some historical reasons from Lucene back in
>>> the days when it could not handle very many unique fieldnames
>>>
>>> Regards Ard
>>>
>>>> H. Wilson
>>>>
>>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote:
>>>>> Hello,
>>>>>
>>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<wilsonh@randdss.com>
>>>>>   wrote:
>>>>>>   OK, well I got the spaces part figured out, and will post it for
>>>>>> anyone
>>>>>> who
>>>>>> needs it. Putting quotes around the spaces unfortunately did not
work.
>>>>>>   During testing, I determined that if you performed the following
query
>>>>>> for
>>>>>> the exact fullName property:
>>>>>>
>>>>>>     filter.addContains ( @fullName,
>>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West Land"));
>>>>>>
>>>>>> It would return nothing. But tweak it a little and add a wildcard,
and
>>>>>> it
>>>>>> would return results:
>>>>>>
>>>>>>    filter.addContains ( @fullName,
>>>>>>    '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>> Lan*"));
>>>>> This does not make sense...see below
>>>>>
>>>>>> But since I did not want to throw in wild cards where they might
not be
>>>>>> wanted, if a search string contained spaces, did not contain wild
cards
>>>>>> and
>>>>>> the user was not concerned with case sensitivity, I used the
>>>>>> fn:lower-case.
>>>>>> So I ended up with the following excerpt (our clients wanted options
>>>>>> for
>>>>>> case sensitive and case insensitive searching) .
>>>>>>
>>>>>> public OurParameter[] getOurParameters (boolean
>>>>>> performCaseSensitiveSearch,
>>>>>> String searchTerm, String srchField ) { //srchField in this case
was
>>>>>> fullName
>>>>>>
>>>>>>    .....
>>>>>>
>>>>>>    if ( performCaseSensitiveSearch) {
>>>>>>
>>>>>>        //jcr:like for case sensitive
>>>>>>        filter.orJCRExpression ("jcr:like(@" + srchField +",
>>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");
>>>>>>
>>>>>>    }
>>>>>>    else {
>>>>>>
>>>>>>        //only use fn:lower-case if there is spaces, with NO wild
cards
>>>>>>
>>>>>>        if ( searchTerm.contains (" ")&&        !searchTerm.contains
>>>>>> ("*")&&
>>>>>>   !searchTerm.contains ("?") ) {
>>>>>>
>>>>>>            filter.addJCRExpression ("fn:lower-case(@"+srchField+")
=
>>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>>
>>>>>>        }
>>>>>>
>>>>>>        else {
>>>>>>
>>>>>>            //jcr:contains for case insensitive
>>>>>>            filter.addContains ( srchField,
>>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>>
>>>>>>        }
>>>>>>
>>>>>>    }
>>>>> This seems to me a workaround around the real problem, because, it
>>>>> just doesn't make sense to me. Can you inspect the tokens that are
>>>>> created by your analyser. Make sure you inspect the tokens during
>>>>> indexing (just store something) and during searching: just search in
>>>>> the property. I am quite sure you'll see the issue then. Perhaps
>>>>> something with Text.escapeIllegalXpathSearchChars though it seems that
>>>>> it should leave spaces untouched
>>>>>
>>>>> Regards Ard
>>>>>
>>>>>
>>>>>>    ....
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Hope that helps anyone who needs it.
>>>>>>
>>>>>> H. Wilson
>>>>>>
>>>>>>>> OK so it looks like I have one other issue. Using the configuration
>>>>>>>> as
>>>>>>>> posted below and sticking to my previous examples, with the
addition
>>>>>>>> of
>>>>>>>> one
>>>>>>>> with whitespace. With the following three in our repository:
>>>>>>>>
>>>>>>>>    .North.South.East.WestLand
>>>>>>>>    .North.South.East.West_Land
>>>>>>>>    .North.South.East.West Land    //yes that's a space
>>>>>>>>
>>>>>>>> ...using a jcr:contains, with exact name search with NO wild
cards:
>>>>>>>> the
>>>>>>>> first two return properly, but the last one yields no result.
>>>>>>>>
>>>>>>>>    filter.addContains(@fullName,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>> Land") +"'));
>>>>>>> I think the space in a contains is seen as an AND by the
>>>>>>> Jackrabbit/Lucene QueryParser. I should test this however as
I am not
>>>>>>> sure. Perhaps you can put quotes around it, not sure if that
works
>>>>>>> though
>>>>>>>
>>>>>>> Regards Ard
>>>>>>>
>>>>>>>> According to the Lucene documentation, KeywordAnalyzer should
be
>>>>>>>> creating
>>>>>>>> one token, plus combined with escaping the Illegal Characters
(i.e.
>>>>>>>> spaces),
>>>>>>>> shouldn't this search work? Thanks again.
>>>>>>>>
>>>>>>>> H. Wilson

Mime
View raw message