Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@jackrabbit.apache.org
Received-SPF: pass (athena.apache.org: domain of wilsonh@randdss.com
 designates 72.52.242.16 as permitted sender)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=default; d=randdss.com;
	h=Received:Message-ID:Date:From:Reply-To:Organization:User-Agent:MIME-Version:To:Subject:References:In-Reply-To:Content-Type:Content-Transfer-Encoding;
	b=kSqzhUOSWW8qvobSW2wp3TdyAmoHJS0hmC3Skqu4ZxWYdaptoClSS65HKStmlz9tocpfYmE4F84Ltccm2CDrSOpI6O5jeyrA5tmWToGUtSmxizhOSnrwqyeH2A/bT/XT;
Message-ID: <4C7D6A4B.8090302@randdss.com>
Date: Tue, 31 Aug 2010 16:47:07 -0400
From: "H. Wilson" <wilsonh@randdss.com>
Reply-To: wilsonh@randdss.com
Organization: R & D Software Systems
User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US;
 rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2
MIME-Version: 1.0
To: users@jackrabbit.apache.org
Subject: Re: Problems with hyphen in JSR-170 XPath query using jcr:contains
References: 
 <3B889EE9152F28429040ABDC2C606F2E04A201E6EB@MAIL01.CSUMain.csu.edu.au>
	<AANLkTi=dV2WG7xmXLcrhHO0qRD=DpXFLcNb3LCcmUjPh@mail.gmail.com>
	<4C7671D9.3060205@randdss.com>
	<AANLkTinAqhjNcC5P2_wB6pSd_X358x91sz59D_J4L9zx@mail.gmail.com>
	<4C7694CD.2040501@randdss.com>
	<AANLkTikTMcy0jgxx5B7pdMJD4V24HsVQ3wAzOO=OTGg+@mail.gmail.com>
	<4C76CD12.2000109@randdss.com>
	<AANLkTinJQKZFQbzVtisGVYctBrhe0QTU_O5r72nqhYJ5@mail.gmail.com>
	<4C780C98.9050208@randdss.com>
	<AANLkTimAbmQKgD2zc2TGESHrUt4yL1gWq9_5ZQbOqy9H@mail.gmail.com>
	<4C7BB26E.4000400@randdss.com>
	<AANLkTin78xq50EQomhOfmn5rw8oqOeue1GJh_5B1hiKR@mail.gmail.com>
	<4C7BE055.1020209@randdss.com>
 <AANLkTimB8x-htqHtWeuMYCRP9rAECvU0MvnFoR2ouPgn@mail.gmail.com>
In-Reply-To: <AANLkTimB8x-htqHtWeuMYCRP9rAECvU0MvnFoR2ouPgn@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit


On 08/31/2010 03:05 AM, Ard Schrijvers wrote:
>
>> Given the following parameters in the repository:
>>
>>    .North.South.East.WestLand
>>    .North.South.East.West_Land
>>    .North.South.East.West Land    //yes that's a space
>>
>> The following exact name, case sensitive queries worked as expected for each
>> of the three parameters:
>>
>>    filter.orJCRExpression ("jcr:like(@" + srchField
>> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");  //case sens.
> jcr:like does not depend on any analyser but on the stored field, so
> this is not strange that it still works.
I expected this too, I just try to be as thorough as possible when 
posting anywhere. I am disappointed enough I haven't figured this out on 
my own.
>> The following exact name query, case insensitive, worked for only the
>> parameter with a fullName with a whitespace character:
>>
>>    filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>
>> The following exact name queries, case insensitive, stopped working for the
>> fullnames WITHOUT a whitespace character:
>>
>>    filter.addContains ( srchField,
>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>
>> Again, the only change I made was to the analyzer, I didn't remove my
>> "workaround" yet, and I just want to confirm I properly changed the analyzer
>> to figure out how the tokens were working. Oh I should note, the output from
>> the Analyzer only showed one Token per field, which I believe is what we
>> were looking for. Which leaves me as perplexed as before.
>>
>> LowerCaseKeywordAnalyzer.java:
>>
>>    ...
>>
>>    public TokenStream tokenStream ( String field, final Reader reader  ) {
>>             System.out.println ("TOKEN STREAM for field: " + field);
>>             TokenStream keywordTokenStream = super.tokenStream (field,
>> reader);
>>
>>         //changed for testing
>>             TokenStream lowerCaseStream =  new LowerCaseFilter (
>> keywordTokenStream ) ;
>>             final Token reusableToken = new Token();
>>             try {
>>                 Token mytoken = lowerCaseStream.next (reusableToken);
>>                 while ( mytoken != null  ) {
>>                     System.out.println ("[" + mytoken.term() + "]");
>>                     mytoken = lowerCaseStream.next (mytoken);
>>                 }
>>                 //lowerCaseStream.reset();  //uncommenting this did not
>> change results.
>>             }
>>             catch  (IOException ioe) {
>>                 System.err.println ("ERROR: " + ioe.toString());
>>             }
>>
> It's a stream!! So, your keywordTokenStream is now empty. Call reset()
> on the keywordTokenStream before using it again.
>
> Regards Ard
>
>>             return (new LowerCaseFilter ( keywordTokenStream ) );
>>         }
>>
>>    ...
I was real excited when I saw your email this morning. However, 
resetting keywordTokenStream as the last line in the "try" resulted in 
no change. I also tried uncommenting the lowerCaseStream.reset line in 
an act of desperation with no difference. I must be missing something 
completely obvious at this point... look at a problem too long and the 
obvious fails to jump out at you...

H. Wilson
>> Thanks.
>>
>> H. Wilson
>>
>> On 08/30/2010 09:38 AM, Ard Schrijvers wrote:
>>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<wilsonh@randdss.com>    wrote:
>>>>   Ard,
>>>>
>>>> You are absolutely right.. and this didn't make sense to me either. I
>>>> think
>>>> I was too worn out from my week and too excited to have code that
>>>> "worked"
>>>> to notice the obvious... this must be a workaround. However, I will need
>>>> a
>>>> little guidance on how to inspect the tokens. I have Luke, but never
>>>> really
>>>> understood how to use it properly. Could you give me a clear list of
>>>> steps,
>>>> or point me to a resource I missed, on how I would go about inspecting
>>>> tokens during insert/search? Thanks.
>>> I'd just print them to your console with Token#term() or use a
>>> debugger . If you do that during indexing and searching, I think you
>>> must see some difference in the token that explains *why* Lucene
>>> doesn't find a hit for your usecase with spaces.
>>>
>>> Luke is hard to use for the multi-index jackrabbit indexing, as well
>>> as the field value prefixing: It is unfortunate and not completely
>>> necessary any more but has some historical reasons from Lucene back in
>>> the days when it could not handle very many unique fieldnames
>>>
>>> Regards Ard
>>>
>>>> H. Wilson
>>>>
>>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote:
>>>>> Hello,
>>>>>
>>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<wilsonh@randdss.com>
>>>>>   wrote:
>>>>>>   OK, well I got the spaces part figured out, and will post it for
>>>>>> anyone
>>>>>> who
>>>>>> needs it. Putting quotes around the spaces unfortunately did not work.
>>>>>>   During testing, I determined that if you performed the following query
>>>>>> for
>>>>>> the exact fullName property:
>>>>>>
>>>>>>     filter.addContains ( @fullName,
>>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West Land"));
>>>>>>
>>>>>> It would return nothing. But tweak it a little and add a wildcard, and
>>>>>> it
>>>>>> would return results:
>>>>>>
>>>>>>    filter.addContains ( @fullName,
>>>>>>    '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>> Lan*"));
>>>>> This does not make sense...see below
>>>>>
>>>>>> But since I did not want to throw in wild cards where they might not be
>>>>>> wanted, if a search string contained spaces, did not contain wild cards
>>>>>> and
>>>>>> the user was not concerned with case sensitivity, I used the
>>>>>> fn:lower-case.
>>>>>> So I ended up with the following excerpt (our clients wanted options
>>>>>> for
>>>>>> case sensitive and case insensitive searching) .
>>>>>>
>>>>>> public OurParameter[] getOurParameters (boolean
>>>>>> performCaseSensitiveSearch,
>>>>>> String searchTerm, String srchField ) { //srchField in this case was
>>>>>> fullName
>>>>>>
>>>>>>    .....
>>>>>>
>>>>>>    if ( performCaseSensitiveSearch) {
>>>>>>
>>>>>>        //jcr:like for case sensitive
>>>>>>        filter.orJCRExpression ("jcr:like(@" + srchField +",
>>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");
>>>>>>
>>>>>>    }
>>>>>>    else {
>>>>>>
>>>>>>        //only use fn:lower-case if there is spaces, with NO wild cards
>>>>>>
>>>>>>        if ( searchTerm.contains (" ")&&        !searchTerm.contains
>>>>>> ("*")&&
>>>>>>   !searchTerm.contains ("?") ) {
>>>>>>
>>>>>>            filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>>
>>>>>>        }
>>>>>>
>>>>>>        else {
>>>>>>
>>>>>>            //jcr:contains for case insensitive
>>>>>>            filter.addContains ( srchField,
>>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>>
>>>>>>        }
>>>>>>
>>>>>>    }
>>>>> This seems to me a workaround around the real problem, because, it
>>>>> just doesn't make sense to me. Can you inspect the tokens that are
>>>>> created by your analyser. Make sure you inspect the tokens during
>>>>> indexing (just store something) and during searching: just search in
>>>>> the property. I am quite sure you'll see the issue then. Perhaps
>>>>> something with Text.escapeIllegalXpathSearchChars though it seems that
>>>>> it should leave spaces untouched
>>>>>
>>>>> Regards Ard
>>>>>
>>>>>
>>>>>>    ....
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Hope that helps anyone who needs it.
>>>>>>
>>>>>> H. Wilson
>>>>>>
>>>>>>>> OK so it looks like I have one other issue. Using the configuration
>>>>>>>> as
>>>>>>>> posted below and sticking to my previous examples, with the addition
>>>>>>>> of
>>>>>>>> one
>>>>>>>> with whitespace. With the following three in our repository:
>>>>>>>>
>>>>>>>>    .North.South.East.WestLand
>>>>>>>>    .North.South.East.West_Land
>>>>>>>>    .North.South.East.West Land    //yes that's a space
>>>>>>>>
>>>>>>>> ...using a jcr:contains, with exact name search with NO wild cards:
>>>>>>>> the
>>>>>>>> first two return properly, but the last one yields no result.
>>>>>>>>
>>>>>>>>    filter.addContains(@fullName,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>> Land") +"'));
>>>>>>> I think the space in a contains is seen as an AND by the
>>>>>>> Jackrabbit/Lucene QueryParser. I should test this however as I am not
>>>>>>> sure. Perhaps you can put quotes around it, not sure if that works
>>>>>>> though
>>>>>>>
>>>>>>> Regards Ard
>>>>>>>
>>>>>>>> According to the Lucene documentation, KeywordAnalyzer should be
>>>>>>>> creating
>>>>>>>> one token, plus combined with escaping the Illegal Characters (i.e.
>>>>>>>> spaces),
>>>>>>>> shouldn't this search work? Thanks again.
>>>>>>>>
>>>>>>>> H. Wilson