jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: Problems with hyphen in JSR-170 XPath query using jcr:contains
Date Thu, 26 Aug 2010 14:42:14 GMT
On Thu, Aug 26, 2010 at 3:53 PM, H. Wilson <wilsonh@randdss.com> wrote:
>  Ard,
>
> I have this same problem, however my scenario involves underscores rather
> than hyphens. Although since Chris seems to be seeing the same exact

It is because hyphens just as underscores are tokens the Standard
Lucene Analyzer splits on. This combined with query expansion that
happens for wildcard searches in lucene causes your issuess:

> behavior as I was, I imagine we are both stuck on the same issue. After
> scouring the forums for the solution, and not seeing your mentioned
> solution, I actually posted my problem as detailed as possible here (
> http://markmail.org/message/yh72wqd5b2hbr3j6 ) and received no response.
> jcr:like was not an option for me, in this case, as our client wanted the
> option for case-insensitive searches. Is there any chance you could please
> narrow down where-about the post was which already covered this? Thanks for

I can't seem to find my post again. But, I'll give you a quite simple solution:

If you want to have the normal indexing of the property for normal
searching, but also want to have the yyy* option, you need to
duplicate the property also in another property. If your property,
like

.North.South.East.WestLand

is only needed for the one you describe with wildcard searching, you
only need it once. Now, suppose, your property is called myProp.

To your configuration.xml add:

<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <analyzers>
        <analyzer
class="org.mycompany.lucene.analysis.LowerCaseKeywordAnalyzer">
            <property>myProp</property>
        </analyzer>
  </analyzers>
</configuration>

Your LowerCaseKeywordAnalyzer is very simple: it extends
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/KeywordAnalyzer.html
and in the method

 TokenStream tokenStream(String fieldName,Reader reader)

after calling the super, you invoke Lucene's LowerCaseFilter.

That is all (after you do a re-index of your repository). Since now a
-, or _ or ~ or whatever is not seen as a token to split on, but you
still use lowercase filter, you can do exactly what you want.

Do the words need the be split on spaces however? No problem, just add
a WhiteSpaceTokenizer from lucene. It is actually pretty simple,

Hope this helps,

Regards Ard

> your time.
>
> *H. Wilson*
>
>
> On 08/26/2010 04:59 AM, Ard Schrijvers wrote:
>>
>> Hello,
>>
>> You can search the archives (mail from me) for wildcard searching
>> things related below. There was someone having similar issues. I
>> explained the wildcard difficulties. Take a look at jcr:like for your
>> usecases
>>
>> Regards Ard
>>
>> On Thu, Aug 26, 2010 at 10:19 AM, Dunstall, Christopher
>> <cdunstall@csu.edu.au>  wrote:
>>>
>>> Hi all,
>>>
>>> I'm having some trouble with an XPath query, where I'm searching for
>>> users with hyphens in their name.
>>>
>>> I'm using:
>>> jcr:contains(*/*/*,'query')
>>>
>>> And it returns some odd results.
>>>
>>> I have two users, Sophie-Allen and Sophie-Anne. When I search for
>>> 'sophie', I get back users back. Ok, fine, but if I search for 'sophie-a'
>>> (with the hyphen escaped as 'sophie\-a' as per the JSR-170 Spec) I get zero
>>> results returned.  Oddly, if I search for either 'sophie-allen' or
>>> 'sophie-anne' I get the respective user details back fine. Shouldn't I get
>>> both users back when escaping the hyphen? Have I missed something in the
>>> spec?
>>>
>>> One other odd thing is the addition of an asterisk (*).  Searching for
>>> 'soph' and 'soph*' return the same result (both users), but if I search for
>>> 'sophie-allen*', I get zero results, unlike when searching for just
>>> 'sophie-allen'. Searching for 'sophie-a*' has the same result as without the
>>> asterisk, i.e. nothing.
>>>
>>> The JSR-170 spec doesn't say anything (that I can find) but is the
>>> asterisk a wildcard in the jcr:contains function or does it serve some other
>>> purpose?
>>>
>>> Your assistance is greatly appreciated,
>>>
>>> Regards,
>>>
>>> Chris Dunstall | Service Support - Applications
>>> Technology Integration/OLE Virtual Team
>>> Division of Information Technology | Charles Sturt University | Bathurst,
>>> NSW, Australia
>>>
>>> Ph: 02 63384818 | Fax: 02 63384181
>>>
>

Mime
View raw message