jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "H. Wilson" <wils...@randdss.com>
Subject Re: Problems with hyphen in JSR-170 XPath query using jcr:contains
Date Fri, 03 Sep 2010 17:38:19 GMT
  Interesting twist... New test cases (no code modifications) show the 
following queries with trailing question marks DO work:

    ?North.South.East.West-Lan?
    *North.South.East.West-Lan?
    .North.*.West-Lan?

While the following one still does not:

    .North.South.East.West-Lan?


*H. Wilson*
R&D Software Systems, Inc


On 09/03/2010 03:45 AM, Ard Schrijvers wrote:
> Hello Wilson,
>
> On Thu, Sep 2, 2010 at 6:11 PM, H. Wilson<wilsonh@randdss.com>  wrote:
>
>> Some successful queries I ran in my unit tests (out of the 1200+ test
>> queries I have ...) (all of these were tried once as shown and once as
>> "string".toLowerCase() )
>>
>>    .North.South.East.West*
>>    .North.South.East.West-*
>>    .North.South.East.West-Land
>>    *West-Land
>>    .North*
>>
>>
>> Unsuccessful include:
>>
>>    .North.South.East.West-Lan?
>>    .North.South.East.West Land
> I didn't look at code, but I think the analyzer part is just fine. I
> suspect the jackrabbit queryparser to mangle dashes and spaces. I am
> how ever not sure how you could avoid this. I'd have to look into it.
> Though, you might want to check the JackrabbitQueryParser what it
> makes of your ' .North.South.East.West-Lan?' or
> '.North.South.East.West Land'
>
> Regards Ard
>
>>
>> Good Luck!
>>
>> *H. Wilson*
>>
>>
>> On 09/02/2010 12:28 AM, Dunstall, Christopher wrote:
>>> Just to be clear, the Lowercase Filter makes it even worse, as searching
>>> for 'Arlington-Smythe' or 'Sophie-Anne' returns nothing, whereas without the
>>> filter, you actually got the record.
>>>
>>> Chris Dunstall | Service Support - Applications
>>> Technology Integration/OLE Virtual Team
>>> Division of Information Technology | Charles Sturt University | Bathurst,
>>> NSW
>>>
>>> Ph: 02 63384818 | Fax: 02 63384181
>>>
>>>
>>> -----Original Message-----
>>> From: Dunstall, Christopher [mailto:cdunstall@csu.edu.au]
>>> Sent: Thursday, 2 September 2010 2:19 PM
>>> To: users@jackrabbit.apache.org
>>> Subject: RE: Problems with hyphen in JSR-170 XPath query using
>>> jcr:contains
>>>
>>> I've got the customised Analyzer and Tokenizer working, but it seems I'm
>>> back at square one, maybe even further back because now it looks like it's
>>> being case sensitive.
>>>
>>> My Analyzer:
>>>
>>> public class HyphenKeywordAnalyzer extends KeywordAnalyzer {
>>>    private static final Logger LOGGER =
>>> LoggerFactory.getLogger(HyphenKeywordAnalyzer.class);
>>>
>>>    public TokenStream tokenStream(String field, final Reader reader) {
>>>      LOGGER.info("Custom Analyzer [" + field + "], [" + ((reader != null) ?
>>> reader.toString() : "") + "]");
>>>
>>>      TokenStream keywordTokenStream = new HyphenKeywordTokenizer(reader);
>>>      return keywordTokenStream;
>>>      //return (new LowerCaseFilter(keywordTokenStream));
>>>    }
>>> }
>>>
>>> My HyphenKeywordTokenizer class is practically a direct copy of
>>> KeywordTokenizer, where it emits the entire input as a single token.  As you
>>> can see above, I'm not using the lower case filter, just to see what
>>> happens.
>>>
>>> Once again, I have a user named 'Sophie-Anne' 'Roberts' and a user named
>>> 'Bob' 'Arlington-Smythe'.
>>>
>>> A search for 'Sophie-Anne' produces the user's record, however, a search
>>> for 'sophie-anne' does not (returns nothing), as does 'Sophie-A' and now,
>>> even 'Sophie' or 'Sophie*'. Should I be using double quotes in the query
>>> now?>   From what H. Wilson has found, it doesn't look like it will solve
the
>>> problem.
>>>
>>> The query being used is:
>>> //*[@sling:resourceType="sakai/user-profile" and (jcr:contains(.,
>>> 'Sophie\-Anne') or jcr:contains(*/*/*,'Sophie\-Anne'))] order by @jcr:score
>>> descending]
>>>
>>>
>>> Chris Dunstall | Service Support - Applications
>>> Technology Integration/OLE Virtual Team
>>> Division of Information Technology | Charles Sturt University | Bathurst,
>>> NSW
>>>
>>> Ph: 02 63384818 | Fax: 02 63384181
>>>
>>>
>>> -----Original Message-----
>>> From: H. Wilson [mailto:wilsonh@randdss.com]
>>> Sent: Wednesday, 1 September 2010 6:47 AM
>>> To: users@jackrabbit.apache.org
>>> Subject: Re: Problems with hyphen in JSR-170 XPath query using
>>> jcr:contains
>>>
>>>
>>> On 08/31/2010 03:05 AM, Ard Schrijvers wrote:
>>>>> Given the following parameters in the repository:
>>>>>
>>>>>     .North.South.East.WestLand
>>>>>     .North.South.East.West_Land
>>>>>     .North.South.East.West Land    //yes that's a space
>>>>>
>>>>> The following exact name, case sensitive queries worked as expected for
>>>>> each
>>>>> of the three parameters:
>>>>>
>>>>>     filter.orJCRExpression ("jcr:like(@" + srchField
>>>>> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");  //case
>>>>> sens.
>>>> jcr:like does not depend on any analyser but on the stored field, so
>>>> this is not strange that it still works.
>>> I expected this too, I just try to be as thorough as possible when
>>> posting anywhere. I am disappointed enough I haven't figured this out on
>>> my own.
>>>>> The following exact name query, case insensitive, worked for only the
>>>>> parameter with a fullName with a whitespace character:
>>>>>
>>>>>     filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>
>>>>> The following exact name queries, case insensitive, stopped working for
>>>>> the
>>>>> fullnames WITHOUT a whitespace character:
>>>>>
>>>>>     filter.addContains ( srchField,
>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>
>>>>> Again, the only change I made was to the analyzer, I didn't remove my
>>>>> "workaround" yet, and I just want to confirm I properly changed the
>>>>> analyzer
>>>>> to figure out how the tokens were working. Oh I should note, the output
>>>>> from
>>>>> the Analyzer only showed one Token per field, which I believe is what
we
>>>>> were looking for. Which leaves me as perplexed as before.
>>>>>
>>>>> LowerCaseKeywordAnalyzer.java:
>>>>>
>>>>>     ...
>>>>>
>>>>>     public TokenStream tokenStream ( String field, final Reader reader
 )
>>>>> {
>>>>>              System.out.println ("TOKEN STREAM for field: " + field);
>>>>>              TokenStream keywordTokenStream = super.tokenStream (field,
>>>>> reader);
>>>>>
>>>>>          //changed for testing
>>>>>              TokenStream lowerCaseStream =  new LowerCaseFilter (
>>>>> keywordTokenStream ) ;
>>>>>              final Token reusableToken = new Token();
>>>>>              try {
>>>>>                  Token mytoken = lowerCaseStream.next (reusableToken);
>>>>>                  while ( mytoken != null  ) {
>>>>>                      System.out.println ("[" + mytoken.term() + "]");
>>>>>                      mytoken = lowerCaseStream.next (mytoken);
>>>>>                  }
>>>>>                  //lowerCaseStream.reset();  //uncommenting this did
not
>>>>> change results.
>>>>>              }
>>>>>              catch  (IOException ioe) {
>>>>>                  System.err.println ("ERROR: " + ioe.toString());
>>>>>              }
>>>>>
>>>> It's a stream!! So, your keywordTokenStream is now empty. Call reset()
>>>> on the keywordTokenStream before using it again.
>>>>
>>>> Regards Ard
>>>>
>>>>>              return (new LowerCaseFilter ( keywordTokenStream ) );
>>>>>          }
>>>>>
>>>>>     ...
>>> I was real excited when I saw your email this morning. However,
>>> resetting keywordTokenStream as the last line in the "try" resulted in
>>> no change. I also tried uncommenting the lowerCaseStream.reset line in
>>> an act of desperation with no difference. I must be missing something
>>> completely obvious at this point... look at a problem too long and the
>>> obvious fails to jump out at you...
>>>
>>> H. Wilson
>>>>> Thanks.
>>>>>
>>>>> H. Wilson
>>>>>
>>>>> On 08/30/2010 09:38 AM, Ard Schrijvers wrote:
>>>>>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<wilsonh@randdss.com>
>>>>>> wrote:
>>>>>>>    Ard,
>>>>>>>
>>>>>>> You are absolutely right.. and this didn't make sense to me either.
I
>>>>>>> think
>>>>>>> I was too worn out from my week and too excited to have code
that
>>>>>>> "worked"
>>>>>>> to notice the obvious... this must be a workaround. However,
I will
>>>>>>> need
>>>>>>> a
>>>>>>> little guidance on how to inspect the tokens. I have Luke, but
never
>>>>>>> really
>>>>>>> understood how to use it properly. Could you give me a clear
list of
>>>>>>> steps,
>>>>>>> or point me to a resource I missed, on how I would go about inspecting
>>>>>>> tokens during insert/search? Thanks.
>>>>>> I'd just print them to your console with Token#term() or use a
>>>>>> debugger . If you do that during indexing and searching, I think
you
>>>>>> must see some difference in the token that explains *why* Lucene
>>>>>> doesn't find a hit for your usecase with spaces.
>>>>>>
>>>>>> Luke is hard to use for the multi-index jackrabbit indexing, as well
>>>>>> as the field value prefixing: It is unfortunate and not completely
>>>>>> necessary any more but has some historical reasons from Lucene back
in
>>>>>> the days when it could not handle very many unique fieldnames
>>>>>>
>>>>>> Regards Ard
>>>>>>
>>>>>>> H. Wilson
>>>>>>>
>>>>>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<wilsonh@randdss.com>
>>>>>>>>    wrote:
>>>>>>>>>    OK, well I got the spaces part figured out, and will
post it for
>>>>>>>>> anyone
>>>>>>>>> who
>>>>>>>>> needs it. Putting quotes around the spaces unfortunately
did not
>>>>>>>>> work.
>>>>>>>>>    During testing, I determined that if you performed
the following
>>>>>>>>> query
>>>>>>>>> for
>>>>>>>>> the exact fullName property:
>>>>>>>>>
>>>>>>>>>      filter.addContains ( @fullName,
>>>>>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>>> Land"));
>>>>>>>>>
>>>>>>>>> It would return nothing. But tweak it a little and add
a wildcard,
>>>>>>>>> and
>>>>>>>>> it
>>>>>>>>> would return results:
>>>>>>>>>
>>>>>>>>>     filter.addContains ( @fullName,
>>>>>>>>>     '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>>> Lan*"));
>>>>>>>> This does not make sense...see below
>>>>>>>>
>>>>>>>>> But since I did not want to throw in wild cards where
they might not
>>>>>>>>> be
>>>>>>>>> wanted, if a search string contained spaces, did not
contain wild
>>>>>>>>> cards
>>>>>>>>> and
>>>>>>>>> the user was not concerned with case sensitivity, I used
the
>>>>>>>>> fn:lower-case.
>>>>>>>>> So I ended up with the following excerpt (our clients
wanted options
>>>>>>>>> for
>>>>>>>>> case sensitive and case insensitive searching) .
>>>>>>>>>
>>>>>>>>> public OurParameter[] getOurParameters (boolean
>>>>>>>>> performCaseSensitiveSearch,
>>>>>>>>> String searchTerm, String srchField ) { //srchField in
this case was
>>>>>>>>> fullName
>>>>>>>>>
>>>>>>>>>     .....
>>>>>>>>>
>>>>>>>>>     if ( performCaseSensitiveSearch) {
>>>>>>>>>
>>>>>>>>>         //jcr:like for case sensitive
>>>>>>>>>         filter.orJCRExpression ("jcr:like(@" + srchField
+",
>>>>>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");
>>>>>>>>>
>>>>>>>>>     }
>>>>>>>>>     else {
>>>>>>>>>
>>>>>>>>>         //only use fn:lower-case if there is spaces,
with NO wild
>>>>>>>>> cards
>>>>>>>>>
>>>>>>>>>         if ( searchTerm.contains (" ")&&    
      !searchTerm.contains
>>>>>>>>> ("*")&&
>>>>>>>>>    !searchTerm.contains ("?") ) {
>>>>>>>>>
>>>>>>>>>             filter.addJCRExpression ("fn:lower-case(@"+srchField+")
=
>>>>>>>>>
>>>>>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>>>>>
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>>         else {
>>>>>>>>>
>>>>>>>>>             //jcr:contains for case insensitive
>>>>>>>>>             filter.addContains ( srchField,
>>>>>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>>>>>
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>>     }
>>>>>>>> This seems to me a workaround around the real problem, because,
it
>>>>>>>> just doesn't make sense to me. Can you inspect the tokens
that are
>>>>>>>> created by your analyser. Make sure you inspect the tokens
during
>>>>>>>> indexing (just store something) and during searching: just
search in
>>>>>>>> the property. I am quite sure you'll see the issue then.
Perhaps
>>>>>>>> something with Text.escapeIllegalXpathSearchChars though
it seems
>>>>>>>> that
>>>>>>>> it should leave spaces untouched
>>>>>>>>
>>>>>>>> Regards Ard
>>>>>>>>
>>>>>>>>
>>>>>>>>>     ....
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hope that helps anyone who needs it.
>>>>>>>>>
>>>>>>>>> H. Wilson
>>>>>>>>>
>>>>>>>>>>> OK so it looks like I have one other issue. Using
the
>>>>>>>>>>> configuration
>>>>>>>>>>> as
>>>>>>>>>>> posted below and sticking to my previous examples,
with the
>>>>>>>>>>> addition
>>>>>>>>>>> of
>>>>>>>>>>> one
>>>>>>>>>>> with whitespace. With the following three in
our repository:
>>>>>>>>>>>
>>>>>>>>>>>     .North.South.East.WestLand
>>>>>>>>>>>     .North.South.East.West_Land
>>>>>>>>>>>     .North.South.East.West Land    //yes that's
a space
>>>>>>>>>>>
>>>>>>>>>>> ...using a jcr:contains, with exact name search
with NO wild
>>>>>>>>>>> cards:
>>>>>>>>>>> the
>>>>>>>>>>> first two return properly, but the last one yields
no result.
>>>>>>>>>>>
>>>>>>>>>>>     filter.addContains(@fullName,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>>>>> Land") +"'));
>>>>>>>>>> I think the space in a contains is seen as an AND
by the
>>>>>>>>>> Jackrabbit/Lucene QueryParser. I should test this
however as I am
>>>>>>>>>> not
>>>>>>>>>> sure. Perhaps you can put quotes around it, not sure
if that works
>>>>>>>>>> though
>>>>>>>>>>
>>>>>>>>>> Regards Ard
>>>>>>>>>>
>>>>>>>>>>> According to the Lucene documentation, KeywordAnalyzer
should be
>>>>>>>>>>> creating
>>>>>>>>>>> one token, plus combined with escaping the Illegal
Characters
>>>>>>>>>>> (i.e.
>>>>>>>>>>> spaces),
>>>>>>>>>>> shouldn't this search work? Thanks again.
>>>>>>>>>>>
>>>>>>>>>>> H. Wilson

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message