jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "H. Wilson" <wils...@randdss.com>
Subject Re: Problems with hyphen in JSR-170 XPath query using jcr:contains
Date Thu, 26 Aug 2010 16:22:37 GMT
  Finally! I have been hacking away at this here and there for months, 
trying all different analyzers or not-using analyzers and modifying my 
queries all to no avail! Since I always like precise examples when I am 
searching forums, I will post my (nearly) exact solution both for others 
and so that Ard might verify that this was indeed what he meant.

Ard, I was hoping you could embellish a little on why we would duplicate 
the property? (I didn't actually do it to get this working perfectly) 
You lost me a little there, was it for efficiency? Thanks for everything!

H. Wilson

repository.xml (modified both SearchIndex tags to include an 
indexingConfiguration):

    <SearchIndex
    class="org.apache.jackrabbit.core.query.lucene.SearchIndex">

        ....
        <param name="indexingConfiguration"
        value="${rep.home}/indexing_configuration.xml"/>

    </SearchIndex>


indexing_configuration.xml:

    <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
    <analyzers>
    <analyzer
    class="org.mycompany.lucene.analysis.LowerCaseKeywordAnalyzer">
    <property>fullName</property>
    </analyzer>
    </analyzers>
    </configuration>


LowerCaseKeywordAnalyzer.java:

    package org.mycompany.lucene.analysis;
         import java.io.Reader;
         import org.apache.lucene.analysis.KeywordAnalyzer;
         import org.apache.lucene.analysis.LowerCaseFilter;
         import org.apache.lucene.analysis.TokenStream;

    public class LowerCaseKeywordAnalyzer extends KeywordAnalyzer {

         public TokenStream tokenStream ( String field, final Reader
    reader  ) {
             TokenStream keywordTokenStream = super.tokenStream (field,
    reader);
             return ( new LowerCaseFilter ( keywordTokenStream ) );
         }
    }


Our search class has a method which then does the following:

    public OurParameter[] getOurParameters (String searchTerm, String
    srchField ) { //srchField in this case was fullName

        TransientRepository repository = new TransientRepository (
        OUR_REPO_CONFIG, OUR_REPO_LOCATION);
        Session session = repository.login ();
        List<Class> classes = new ArrayList<Class>();
        classes.add (OurParameter.class);
        Mapper mapper = new AnnotationMapperImpl (classes);
        ObjectContentManager ocm = new ObjectContentManagerImpl
        (session, mapper);
        queryManager = ocm.getQueryManager();
        FilterImpl filter = (FilterImpl)queryManager.createFilter
        (OurParameter.class);
        filter.addContains ( srchField,
        org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(searchTerm).replaceAll
        ("'","''"));
        // (that last was replace all single ticks with two ticks, I
        honestly can't remember why though)
        Query query = queryManager.createQuery (filter);
        Collection<OurParameter> resultsCollection =
        (Collection<OurParameter>)ocm.getObjects(query);

        //convert to an array, do some other stuff, and return...

    }



On 08/26/2010 10:42 AM, Ard Schrijvers wrote:
> On Thu, Aug 26, 2010 at 3:53 PM, H. Wilson<wilsonh@randdss.com>  wrote:
>>   Ard,
>>
>> I have this same problem, however my scenario involves underscores rather
>> than hyphens. Although since Chris seems to be seeing the same exact
> It is because hyphens just as underscores are tokens the Standard
> Lucene Analyzer splits on. This combined with query expansion that
> happens for wildcard searches in lucene causes your issuess:
>
>> behavior as I was, I imagine we are both stuck on the same issue. After
>> scouring the forums for the solution, and not seeing your mentioned
>> solution, I actually posted my problem as detailed as possible here (
>> http://markmail.org/message/yh72wqd5b2hbr3j6 ) and received no response.
>> jcr:like was not an option for me, in this case, as our client wanted the
>> option for case-insensitive searches. Is there any chance you could please
>> narrow down where-about the post was which already covered this? Thanks for
> I can't seem to find my post again. But, I'll give you a quite simple solution:
>
> If you want to have the normal indexing of the property for normal
> searching, but also want to have the yyy* option, you need to
> duplicate the property also in another property. If your property,
> like
>
> .North.South.East.WestLand
>
> is only needed for the one you describe with wildcard searching, you
> only need it once. Now, suppose, your property is called myProp.
>
> To your configuration.xml add:
>
> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
>    <analyzers>
>          <analyzer
> class="org.mycompany.lucene.analysis.LowerCaseKeywordAnalyzer">
>              <property>myProp</property>
>          </analyzer>
>    </analyzers>
> </configuration>
>
> Your LowerCaseKeywordAnalyzer is very simple: it extends
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/KeywordAnalyzer.html
> and in the method
>
>   TokenStream tokenStream(String fieldName,Reader reader)
>
> after calling the super, you invoke Lucene's LowerCaseFilter.
>
> That is all (after you do a re-index of your repository). Since now a
> -, or _ or ~ or whatever is not seen as a token to split on, but you
> still use lowercase filter, you can do exactly what you want.
>
> Do the words need the be split on spaces however? No problem, just add
> a WhiteSpaceTokenizer from lucene. It is actually pretty simple,
>
> Hope this helps,
>
> Regards Ard
>
>> your time.
>>
>> *H. Wilson*
>>
>>
>> On 08/26/2010 04:59 AM, Ard Schrijvers wrote:
>>> Hello,
>>>
>>> You can search the archives (mail from me) for wildcard searching
>>> things related below. There was someone having similar issues. I
>>> explained the wildcard difficulties. Take a look at jcr:like for your
>>> usecases
>>>
>>> Regards Ard
>>>
>>> On Thu, Aug 26, 2010 at 10:19 AM, Dunstall, Christopher
>>> <cdunstall@csu.edu.au>    wrote:
>>>> Hi all,
>>>>
>>>> I'm having some trouble with an XPath query, where I'm searching for
>>>> users with hyphens in their name.
>>>>
>>>> I'm using:
>>>> jcr:contains(*/*/*,'query')
>>>>
>>>> And it returns some odd results.
>>>>
>>>> I have two users, Sophie-Allen and Sophie-Anne. When I search for
>>>> 'sophie', I get back users back. Ok, fine, but if I search for 'sophie-a'
>>>> (with the hyphen escaped as 'sophie\-a' as per the JSR-170 Spec) I get zero
>>>> results returned.  Oddly, if I search for either 'sophie-allen' or
>>>> 'sophie-anne' I get the respective user details back fine. Shouldn't I get
>>>> both users back when escaping the hyphen? Have I missed something in the
>>>> spec?
>>>>
>>>> One other odd thing is the addition of an asterisk (*).  Searching for
>>>> 'soph' and 'soph*' return the same result (both users), but if I search for
>>>> 'sophie-allen*', I get zero results, unlike when searching for just
>>>> 'sophie-allen'. Searching for 'sophie-a*' has the same result as without
the
>>>> asterisk, i.e. nothing.
>>>>
>>>> The JSR-170 spec doesn't say anything (that I can find) but is the
>>>> asterisk a wildcard in the jcr:contains function or does it serve some other
>>>> purpose?
>>>>
>>>> Your assistance is greatly appreciated,
>>>>
>>>> Regards,
>>>>
>>>> Chris Dunstall | Service Support - Applications
>>>> Technology Integration/OLE Virtual Team
>>>> Division of Information Technology | Charles Sturt University | Bathurst,
>>>> NSW, Australia
>>>>
>>>> Ph: 02 63384818 | Fax: 02 63384181
>>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message