jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: Problems with hyphen in JSR-170 XPath query using jcr:contains
Date Fri, 27 Aug 2010 07:05:48 GMT
On Thu, Aug 26, 2010 at 10:22 PM, H. Wilson <wilsonh@randdss.com> wrote:
>
> On 08/26/2010 12:57 PM, Ard Schrijvers wrote:
>>
>> Hello Wilson et al,
>> In that case, sry for my late help. I am not always in a position to
>> take time to help. Also, query expansion with wildcard searching is
>> imo not Lucene's best part. Anyway, for those interested, I could try
>> to dig up some mails I send internally in the past: It is something
>> that is hard to grasp without having some Lucene background though
>
> No need to apologize. I was tempted to bump it after a month, but I wasn't
> sure if that violated forum etiquette. I hope the OP today is getting as
> much out of this as I am!
>>
>> Yes, this is how I meant it, with the analyser part.
>> I meant this that you would need this *only* if you also want the
>> original 'free text indexing' of the property. Thus, if you would like
>> to index some property both as the original jackrabbit indexing, but
>> you also want a KeyWord like one, you need the property twice...but,
>> normally, you don't need this.
>> You're welcome.
>>
>> Thank you for reporting back that it works.
>>
>> Regards Ard
>
> OK so it looks like I have one other issue. Using the configuration as
> posted below and sticking to my previous examples, with the addition of one
> with whitespace. With the following three in our repository:
>
>   .North.South.East.WestLand
>   .North.South.East.West_Land
>   .North.South.East.West Land    //yes that's a space
>
> ...using a jcr:contains, with exact name search with NO wild cards: the
> first two return properly, but the last one yields no result.
>
>   filter.addContains(@fullName,
> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
> Land") +"'));

I think the space in a contains is seen as an AND by the
Jackrabbit/Lucene QueryParser. I should test this however as I am not
sure. Perhaps you can put quotes around it, not sure if that works
though

Regards Ard

>
> According to the Lucene documentation, KeywordAnalyzer should be creating
> one token, plus combined with escaping the Illegal Characters (i.e. spaces),
> shouldn't this search work? Thanks again.
>
> H. Wilson
>>>
>>> H. Wilson
>>>
>>> repository.xml (modified both SearchIndex tags to include an
>>> indexingConfiguration):
>>>
>>> <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>>>
>>> ....
>>> <param name="indexingConfiguration"
>>> value="${rep.home}/indexing_configuration.xml"/>
>>>
>>> </SearchIndex>
>>>
>>> indexing_configuration.xml:
>>>
>>> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
>>>     <analyzers>
>>>         <analyzer
>>> class="org.mycompany.lucene.analysis.LowerCaseKeywordAnalyzer">
>>>             <property>fullName</property>
>>>         </analyzer>
>>>     </analyzers>
>>> </configuration>
>>>
>>> LowerCaseKeywordAnalyzer.java:
>>>
>>> package org.mycompany.lucene.analysis;
>>>     import java.io.Reader;
>>>     import org.apache.lucene.analysis.KeywordAnalyzer;
>>>     import org.apache.lucene.analysis.LowerCaseFilter;
>>>     import org.apache.lucene.analysis.TokenStream;
>>>
>>> public class LowerCaseKeywordAnalyzer extends KeywordAnalyzer {
>>>
>>>     public TokenStream tokenStream ( String field, final Reader reader  )
>>> {
>>>         TokenStream keywordTokenStream = super.tokenStream (field,
>>> reader);
>>>         return ( new LowerCaseFilter ( keywordTokenStream ) );
>>>     }
>>> }
>>>
>>> Our search class has a method which then does the following:
>>>
>>> public OurParameter[] getOurParameters (String searchTerm, String
>>> srchField
>>> ) { //srchField in this case was fullName
>>>
>>> TransientRepository repository = new TransientRepository (
>>> OUR_REPO_CONFIG,
>>> OUR_REPO_LOCATION);
>>> Session session = repository.login ();
>>> List<Class>  classes = new ArrayList<Class>();
>>> classes.add (OurParameter.class);
>>> Mapper mapper = new AnnotationMapperImpl (classes);
>>> ObjectContentManager ocm = new ObjectContentManagerImpl (session,
>>> mapper);
>>> queryManager = ocm.getQueryManager();
>>> FilterImpl filter = (FilterImpl)queryManager.createFilter
>>> (OurParameter.class);
>>> filter.addContains ( srchField,
>>>
>>> org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(searchTerm).replaceAll
>>> ("'","''"));
>>> // (that last was replace all single ticks with two ticks, I honestly
>>> can't
>>> remember why though)
>>> Query query = queryManager.createQuery (filter);
>>> Collection<OurParameter>  resultsCollection =
>>> (Collection<OurParameter>)ocm.getObjects(query);
>>>
>>> //convert to an array, do some other stuff, and return...
>>>
>>> }
>>>
>>>
>>> On 08/26/2010 10:42 AM, Ard Schrijvers wrote:
>>>
>>> On Thu, Aug 26, 2010 at 3:53 PM, H. Wilson<wilsonh@randdss.com>  wrote:
>>>
>>>  Ard,
>>>
>>> I have this same problem, however my scenario involves underscores rather
>>> than hyphens. Although since Chris seems to be seeing the same exact
>>>
>>> It is because hyphens just as underscores are tokens the Standard
>>> Lucene Analyzer splits on. This combined with query expansion that
>>> happens for wildcard searches in lucene causes your issuess:
>>>
>>> behavior as I was, I imagine we are both stuck on the same issue. After
>>> scouring the forums for the solution, and not seeing your mentioned
>>> solution, I actually posted my problem as detailed as possible here (
>>> http://markmail.org/message/yh72wqd5b2hbr3j6 ) and received no response.
>>> jcr:like was not an option for me, in this case, as our client wanted the
>>> option for case-insensitive searches. Is there any chance you could
>>> please
>>> narrow down where-about the post was which already covered this? Thanks
>>> for
>>>
>>> I can't seem to find my post again. But, I'll give you a quite simple
>>> solution:
>>>
>>> If you want to have the normal indexing of the property for normal
>>> searching, but also want to have the yyy* option, you need to
>>> duplicate the property also in another property. If your property,
>>> like
>>>
>>> .North.South.East.WestLand
>>>
>>> is only needed for the one you describe with wildcard searching, you
>>> only need it once. Now, suppose, your property is called myProp.
>>>
>>> To your configuration.xml add:
>>>
>>> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
>>>   <analyzers>
>>>         <analyzer
>>> class="org.mycompany.lucene.analysis.LowerCaseKeywordAnalyzer">
>>>             <property>myProp</property>
>>>         </analyzer>
>>>   </analyzers>
>>> </configuration>
>>>
>>> Your LowerCaseKeywordAnalyzer is very simple: it extends
>>>
>>> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/KeywordAnalyzer.html
>>> and in the method
>>>
>>>  TokenStream tokenStream(String fieldName,Reader reader)
>>>
>>> after calling the super, you invoke Lucene's LowerCaseFilter.
>>>
>>> That is all (after you do a re-index of your repository). Since now a
>>> -, or _ or ~ or whatever is not seen as a token to split on, but you
>>> still use lowercase filter, you can do exactly what you want.
>>>
>>> Do the words need the be split on spaces however? No problem, just add
>>> a WhiteSpaceTokenizer from lucene. It is actually pretty simple,
>>>
>>> Hope this helps,
>>>
>>> Regards Ard
>>>
>>> your time.
>>>
>>> *H. Wilson*
>>>
>>>
>>> On 08/26/2010 04:59 AM, Ard Schrijvers wrote:
>>>
>>> Hello,
>>>
>>> You can search the archives (mail from me) for wildcard searching
>>> things related below. There was someone having similar issues. I
>>> explained the wildcard difficulties. Take a look at jcr:like for your
>>> usecases
>>>
>>> Regards Ard
>>>
>>> On Thu, Aug 26, 2010 at 10:19 AM, Dunstall, Christopher
>>> <cdunstall@csu.edu.au>    wrote:
>>>
>>> Hi all,
>>>
>>> I'm having some trouble with an XPath query, where I'm searching for
>>> users with hyphens in their name.
>>>
>>> I'm using:
>>> jcr:contains(*/*/*,'query')
>>>
>>> And it returns some odd results.
>>>
>>> I have two users, Sophie-Allen and Sophie-Anne. When I search for
>>> 'sophie', I get back users back. Ok, fine, but if I search for 'sophie-a'
>>> (with the hyphen escaped as 'sophie\-a' as per the JSR-170 Spec) I get
>>> zero
>>> results returned.  Oddly, if I search for either 'sophie-allen' or
>>> 'sophie-anne' I get the respective user details back fine. Shouldn't I
>>> get
>>> both users back when escaping the hyphen? Have I missed something in the
>>> spec?
>>>
>>> One other odd thing is the addition of an asterisk (*).  Searching for
>>> 'soph' and 'soph*' return the same result (both users), but if I search
>>> for
>>> 'sophie-allen*', I get zero results, unlike when searching for just
>>> 'sophie-allen'. Searching for 'sophie-a*' has the same result as without
>>> the
>>> asterisk, i.e. nothing.
>>>
>>> The JSR-170 spec doesn't say anything (that I can find) but is the
>>> asterisk a wildcard in the jcr:contains function or does it serve some
>>> other
>>> purpose?
>>>
>>> Your assistance is greatly appreciated,
>>>
>>> Regards,
>>>
>>> Chris Dunstall | Service Support - Applications
>>> Technology Integration/OLE Virtual Team
>>> Division of Information Technology | Charles Sturt University | Bathurst,
>>> NSW, Australia
>>>
>>> Ph: 02 63384818 | Fax: 02 63384181
>>>
>>>
>

Mime
View raw message