jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "H. Wilson" <wils...@randdss.com>
Subject Re: Problems with hyphen in JSR-170 XPath query using jcr:contains
Date Mon, 30 Aug 2010 16:46:13 GMT
  Ard,

I don't have exact results for you yet because I am seeing unexpected 
behavior and I would like to make sure I am using the TokenStream/Token 
classes correctly. To verify the Tokens, I modified JUST my custom 
analyzer to look like below. I did NOT remove my "workaround" code, as I 
wanted to make sure that adding this code to the analyzer would allow it 
to compile, run and still behave as it did before. However, when I added 
_just_ this code, my results changed so strangely that it led me to 
believe I was using TokenStream incorrectly. Can you confirm? The change 
in results was: previously all my test queries worked, whereas only 
adding the below code to the analyzer led to the following:

Given the following parameters in the repository:

    .North.South.East.WestLand
    .North.South.East.West_Land
    .North.South.East.West Land    //yes that's a space

The following exact name, case sensitive queries worked as expected for 
each of the three parameters:

    filter.orJCRExpression ("jcr:like(@" + srchField +",'"+Text.escapeIllegalXpathSearchChars
(searchTerm)+"')");  //case sens.

The following exact name query, case insensitive, worked for only the 
parameter with a fullName with a whitespace character:

    filter.addJCRExpression ("fn:lower-case(@"+srchField+") = '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");

The following exact name queries, case insensitive, stopped working for 
the fullnames WITHOUT a whitespace character:

    filter.addContains ( srchField, Text.escapeIllegalXpathSearchChars(searchTerm));

Again, the only change I made was to the analyzer, I didn't remove my 
"workaround" yet, and I just want to confirm I properly changed the 
analyzer to figure out how the tokens were working. Oh I should note, 
the output from the Analyzer only showed one Token per field, which I 
believe is what we were looking for. Which leaves me as perplexed as 
before.

LowerCaseKeywordAnalyzer.java:

    ...

    public TokenStream tokenStream ( String field, final Reader reader  ) {
             System.out.println ("TOKEN STREAM for field: " + field);
             TokenStream keywordTokenStream = super.tokenStream (field, reader);

    	//changed for testing
             TokenStream lowerCaseStream =  new LowerCaseFilter ( keywordTokenStream ) ;
             final Token reusableToken = new Token();
             try {
                 Token mytoken = lowerCaseStream.next (reusableToken);
                 while ( mytoken != null  ) {
                     System.out.println ("[" + mytoken.term() + "]");
                     mytoken = lowerCaseStream.next (mytoken);
                 }
                 //lowerCaseStream.reset();  //uncommenting this did not change results.
             }
             catch  (IOException ioe) {
                 System.err.println ("ERROR: " + ioe.toString());
             }

             return (new LowerCaseFilter ( keywordTokenStream ) );
         }

    ...

Thanks.

H. Wilson

On 08/30/2010 09:38 AM, Ard Schrijvers wrote:
> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<wilsonh@randdss.com>  wrote:
>>   Ard,
>>
>> You are absolutely right.. and this didn't make sense to me either. I think
>> I was too worn out from my week and too excited to have code that "worked"
>> to notice the obvious... this must be a workaround. However, I will need a
>> little guidance on how to inspect the tokens. I have Luke, but never really
>> understood how to use it properly. Could you give me a clear list of steps,
>> or point me to a resource I missed, on how I would go about inspecting
>> tokens during insert/search? Thanks.
> I'd just print them to your console with Token#term() or use a
> debugger . If you do that during indexing and searching, I think you
> must see some difference in the token that explains *why* Lucene
> doesn't find a hit for your usecase with spaces.
>
> Luke is hard to use for the multi-index jackrabbit indexing, as well
> as the field value prefixing: It is unfortunate and not completely
> necessary any more but has some historical reasons from Lucene back in
> the days when it could not handle very many unique fieldnames
>
> Regards Ard
>
>> H. Wilson
>>
>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote:
>>> Hello,
>>>
>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<wilsonh@randdss.com>    wrote:
>>>>   OK, well I got the spaces part figured out, and will post it for anyone
>>>> who
>>>> needs it. Putting quotes around the spaces unfortunately did not work.
>>>>   During testing, I determined that if you performed the following query
>>>> for
>>>> the exact fullName property:
>>>>
>>>>     filter.addContains ( @fullName,
>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West Land"));
>>>>
>>>> It would return nothing. But tweak it a little and add a wildcard, and it
>>>> would return results:
>>>>
>>>>    filter.addContains ( @fullName,
>>>>    '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West Lan*"));
>>> This does not make sense...see below
>>>
>>>> But since I did not want to throw in wild cards where they might not be
>>>> wanted, if a search string contained spaces, did not contain wild cards
>>>> and
>>>> the user was not concerned with case sensitivity, I used the
>>>> fn:lower-case.
>>>> So I ended up with the following excerpt (our clients wanted options for
>>>> case sensitive and case insensitive searching) .
>>>>
>>>> public OurParameter[] getOurParameters (boolean
>>>> performCaseSensitiveSearch,
>>>> String searchTerm, String srchField ) { //srchField in this case was
>>>> fullName
>>>>
>>>>    .....
>>>>
>>>>    if ( performCaseSensitiveSearch) {
>>>>
>>>>        //jcr:like for case sensitive
>>>>        filter.orJCRExpression ("jcr:like(@" + srchField +",
>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");
>>>>
>>>>    }
>>>>    else {
>>>>
>>>>        //only use fn:lower-case if there is spaces, with NO wild cards
>>>>
>>>>        if ( searchTerm.contains (" ")&&      !searchTerm.contains
("*")&&
>>>>   !searchTerm.contains ("?") ) {
>>>>
>>>>            filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>
>>>>        }
>>>>
>>>>        else {
>>>>
>>>>            //jcr:contains for case insensitive
>>>>            filter.addContains ( srchField,
>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>
>>>>        }
>>>>
>>>>    }
>>> This seems to me a workaround around the real problem, because, it
>>> just doesn't make sense to me. Can you inspect the tokens that are
>>> created by your analyser. Make sure you inspect the tokens during
>>> indexing (just store something) and during searching: just search in
>>> the property. I am quite sure you'll see the issue then. Perhaps
>>> something with Text.escapeIllegalXpathSearchChars though it seems that
>>> it should leave spaces untouched
>>>
>>> Regards Ard
>>>
>>>
>>>>    ....
>>>>
>>>> }
>>>>
>>>>
>>>> Hope that helps anyone who needs it.
>>>>
>>>> H. Wilson
>>>>
>>>>>> OK so it looks like I have one other issue. Using the configuration
as
>>>>>> posted below and sticking to my previous examples, with the addition
of
>>>>>> one
>>>>>> with whitespace. With the following three in our repository:
>>>>>>
>>>>>>    .North.South.East.WestLand
>>>>>>    .North.South.East.West_Land
>>>>>>    .North.South.East.West Land    //yes that's a space
>>>>>>
>>>>>> ...using a jcr:contains, with exact name search with NO wild cards:
the
>>>>>> first two return properly, but the last one yields no result.
>>>>>>
>>>>>>    filter.addContains(@fullName,
>>>>>>
>>>>>>
>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>> Land") +"'));
>>>>> I think the space in a contains is seen as an AND by the
>>>>> Jackrabbit/Lucene QueryParser. I should test this however as I am not
>>>>> sure. Perhaps you can put quotes around it, not sure if that works
>>>>> though
>>>>>
>>>>> Regards Ard
>>>>>
>>>>>> According to the Lucene documentation, KeywordAnalyzer should be
>>>>>> creating
>>>>>> one token, plus combined with escaping the Illegal Characters (i.e.
>>>>>> spaces),
>>>>>> shouldn't this search work? Thanks again.
>>>>>>
>>>>>> H. Wilson

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message