Return-Path: Delivered-To: apmail-jackrabbit-users-archive@minotaur.apache.org Received: (qmail 82894 invoked from network); 31 Aug 2010 20:47:35 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 31 Aug 2010 20:47:35 -0000 Received: (qmail 4134 invoked by uid 500); 31 Aug 2010 20:47:35 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 4081 invoked by uid 500); 31 Aug 2010 20:47:34 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 4071 invoked by uid 99); 31 Aug 2010 20:47:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Aug 2010 20:47:34 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of wilsonh@randdss.com designates 72.52.242.16 as permitted sender) Received: from [72.52.242.16] (HELO harley.gnservers.com) (72.52.242.16) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Aug 2010 20:47:28 +0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=default; d=randdss.com; h=Received:Message-ID:Date:From:Reply-To:Organization:User-Agent:MIME-Version:To:Subject:References:In-Reply-To:Content-Type:Content-Transfer-Encoding; b=kSqzhUOSWW8qvobSW2wp3TdyAmoHJS0hmC3Skqu4ZxWYdaptoClSS65HKStmlz9tocpfYmE4F84Ltccm2CDrSOpI6O5jeyrA5tmWToGUtSmxizhOSnrwqyeH2A/bT/XT; Received: from [184.74.154.37] (helo=[192.168.2.213]) by harley.gnservers.com with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.69) (envelope-from ) id 1OqXjT-0003Ar-Lv for users@jackrabbit.apache.org; Tue, 31 Aug 2010 16:47:04 -0400 Message-ID: <4C7D6A4B.8090302@randdss.com> Date: Tue, 31 Aug 2010 16:47:07 -0400 From: "H. Wilson" Reply-To: wilsonh@randdss.com Organization: R & D Software Systems User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2 MIME-Version: 1.0 To: users@jackrabbit.apache.org Subject: Re: Problems with hyphen in JSR-170 XPath query using jcr:contains References: <3B889EE9152F28429040ABDC2C606F2E04A201E6EB@MAIL01.CSUMain.csu.edu.au> <4C7671D9.3060205@randdss.com> <4C7694CD.2040501@randdss.com> <4C76CD12.2000109@randdss.com> <4C780C98.9050208@randdss.com> <4C7BB26E.4000400@randdss.com> <4C7BE055.1020209@randdss.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - harley.gnservers.com X-AntiAbuse: Original Domain - jackrabbit.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - randdss.com On 08/31/2010 03:05 AM, Ard Schrijvers wrote: > >> Given the following parameters in the repository: >> >> .North.South.East.WestLand >> .North.South.East.West_Land >> .North.South.East.West Land //yes that's a space >> >> The following exact name, case sensitive queries worked as expected for each >> of the three parameters: >> >> filter.orJCRExpression ("jcr:like(@" + srchField >> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')"); //case sens. > jcr:like does not depend on any analyser but on the stored field, so > this is not strange that it still works. I expected this too, I just try to be as thorough as possible when posting anywhere. I am disappointed enough I haven't figured this out on my own. >> The following exact name query, case insensitive, worked for only the >> parameter with a fullName with a whitespace character: >> >> filter.addJCRExpression ("fn:lower-case(@"+srchField+") = >> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'"); >> >> The following exact name queries, case insensitive, stopped working for the >> fullnames WITHOUT a whitespace character: >> >> filter.addContains ( srchField, >> Text.escapeIllegalXpathSearchChars(searchTerm)); >> >> Again, the only change I made was to the analyzer, I didn't remove my >> "workaround" yet, and I just want to confirm I properly changed the analyzer >> to figure out how the tokens were working. Oh I should note, the output from >> the Analyzer only showed one Token per field, which I believe is what we >> were looking for. Which leaves me as perplexed as before. >> >> LowerCaseKeywordAnalyzer.java: >> >> ... >> >> public TokenStream tokenStream ( String field, final Reader reader ) { >> System.out.println ("TOKEN STREAM for field: " + field); >> TokenStream keywordTokenStream = super.tokenStream (field, >> reader); >> >> //changed for testing >> TokenStream lowerCaseStream = new LowerCaseFilter ( >> keywordTokenStream ) ; >> final Token reusableToken = new Token(); >> try { >> Token mytoken = lowerCaseStream.next (reusableToken); >> while ( mytoken != null ) { >> System.out.println ("[" + mytoken.term() + "]"); >> mytoken = lowerCaseStream.next (mytoken); >> } >> //lowerCaseStream.reset(); //uncommenting this did not >> change results. >> } >> catch (IOException ioe) { >> System.err.println ("ERROR: " + ioe.toString()); >> } >> > It's a stream!! So, your keywordTokenStream is now empty. Call reset() > on the keywordTokenStream before using it again. > > Regards Ard > >> return (new LowerCaseFilter ( keywordTokenStream ) ); >> } >> >> ... I was real excited when I saw your email this morning. However, resetting keywordTokenStream as the last line in the "try" resulted in no change. I also tried uncommenting the lowerCaseStream.reset line in an act of desperation with no difference. I must be missing something completely obvious at this point... look at a problem too long and the obvious fails to jump out at you... H. Wilson >> Thanks. >> >> H. Wilson >> >> On 08/30/2010 09:38 AM, Ard Schrijvers wrote: >>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson wrote: >>>> Ard, >>>> >>>> You are absolutely right.. and this didn't make sense to me either. I >>>> think >>>> I was too worn out from my week and too excited to have code that >>>> "worked" >>>> to notice the obvious... this must be a workaround. However, I will need >>>> a >>>> little guidance on how to inspect the tokens. I have Luke, but never >>>> really >>>> understood how to use it properly. Could you give me a clear list of >>>> steps, >>>> or point me to a resource I missed, on how I would go about inspecting >>>> tokens during insert/search? Thanks. >>> I'd just print them to your console with Token#term() or use a >>> debugger . If you do that during indexing and searching, I think you >>> must see some difference in the token that explains *why* Lucene >>> doesn't find a hit for your usecase with spaces. >>> >>> Luke is hard to use for the multi-index jackrabbit indexing, as well >>> as the field value prefixing: It is unfortunate and not completely >>> necessary any more but has some historical reasons from Lucene back in >>> the days when it could not handle very many unique fieldnames >>> >>> Regards Ard >>> >>>> H. Wilson >>>> >>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote: >>>>> Hello, >>>>> >>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson >>>>> wrote: >>>>>> OK, well I got the spaces part figured out, and will post it for >>>>>> anyone >>>>>> who >>>>>> needs it. Putting quotes around the spaces unfortunately did not work. >>>>>> During testing, I determined that if you performed the following query >>>>>> for >>>>>> the exact fullName property: >>>>>> >>>>>> filter.addContains ( @fullName, >>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West Land")); >>>>>> >>>>>> It would return nothing. But tweak it a little and add a wildcard, and >>>>>> it >>>>>> would return results: >>>>>> >>>>>> filter.addContains ( @fullName, >>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West >>>>>> Lan*")); >>>>> This does not make sense...see below >>>>> >>>>>> But since I did not want to throw in wild cards where they might not be >>>>>> wanted, if a search string contained spaces, did not contain wild cards >>>>>> and >>>>>> the user was not concerned with case sensitivity, I used the >>>>>> fn:lower-case. >>>>>> So I ended up with the following excerpt (our clients wanted options >>>>>> for >>>>>> case sensitive and case insensitive searching) . >>>>>> >>>>>> public OurParameter[] getOurParameters (boolean >>>>>> performCaseSensitiveSearch, >>>>>> String searchTerm, String srchField ) { //srchField in this case was >>>>>> fullName >>>>>> >>>>>> ..... >>>>>> >>>>>> if ( performCaseSensitiveSearch) { >>>>>> >>>>>> //jcr:like for case sensitive >>>>>> filter.orJCRExpression ("jcr:like(@" + srchField +", >>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')"); >>>>>> >>>>>> } >>>>>> else { >>>>>> >>>>>> //only use fn:lower-case if there is spaces, with NO wild cards >>>>>> >>>>>> if ( searchTerm.contains (" ")&& !searchTerm.contains >>>>>> ("*")&& >>>>>> !searchTerm.contains ("?") ) { >>>>>> >>>>>> filter.addJCRExpression ("fn:lower-case(@"+srchField+") = >>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'"); >>>>>> >>>>>> } >>>>>> >>>>>> else { >>>>>> >>>>>> //jcr:contains for case insensitive >>>>>> filter.addContains ( srchField, >>>>>> Text.escapeIllegalXpathSearchChars(searchTerm)); >>>>>> >>>>>> } >>>>>> >>>>>> } >>>>> This seems to me a workaround around the real problem, because, it >>>>> just doesn't make sense to me. Can you inspect the tokens that are >>>>> created by your analyser. Make sure you inspect the tokens during >>>>> indexing (just store something) and during searching: just search in >>>>> the property. I am quite sure you'll see the issue then. Perhaps >>>>> something with Text.escapeIllegalXpathSearchChars though it seems that >>>>> it should leave spaces untouched >>>>> >>>>> Regards Ard >>>>> >>>>> >>>>>> .... >>>>>> >>>>>> } >>>>>> >>>>>> >>>>>> Hope that helps anyone who needs it. >>>>>> >>>>>> H. Wilson >>>>>> >>>>>>>> OK so it looks like I have one other issue. Using the configuration >>>>>>>> as >>>>>>>> posted below and sticking to my previous examples, with the addition >>>>>>>> of >>>>>>>> one >>>>>>>> with whitespace. With the following three in our repository: >>>>>>>> >>>>>>>> .North.South.East.WestLand >>>>>>>> .North.South.East.West_Land >>>>>>>> .North.South.East.West Land //yes that's a space >>>>>>>> >>>>>>>> ...using a jcr:contains, with exact name search with NO wild cards: >>>>>>>> the >>>>>>>> first two return properly, but the last one yields no result. >>>>>>>> >>>>>>>> filter.addContains(@fullName, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West >>>>>>>> Land") +"')); >>>>>>> I think the space in a contains is seen as an AND by the >>>>>>> Jackrabbit/Lucene QueryParser. I should test this however as I am not >>>>>>> sure. Perhaps you can put quotes around it, not sure if that works >>>>>>> though >>>>>>> >>>>>>> Regards Ard >>>>>>> >>>>>>>> According to the Lucene documentation, KeywordAnalyzer should be >>>>>>>> creating >>>>>>>> one token, plus combined with escaping the Illegal Characters (i.e. >>>>>>>> spaces), >>>>>>>> shouldn't this search work? Thanks again. >>>>>>>> >>>>>>>> H. Wilson