jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "H. Wilson" <wils...@randdss.com>
Subject Re: Problems with hyphen in JSR-170 XPath query using jcr:contains
Date Thu, 26 Aug 2010 20:22:42 GMT

On 08/26/2010 12:57 PM, Ard Schrijvers wrote:
> Hello Wilson et al,
> In that case, sry for my late help. I am not always in a position to
> take time to help. Also, query expansion with wildcard searching is
> imo not Lucene's best part. Anyway, for those interested, I could try
> to dig up some mails I send internally in the past: It is something
> that is hard to grasp without having some Lucene background though
No need to apologize. I was tempted to bump it after a month, but I 
wasn't sure if that violated forum etiquette. I hope the OP today is 
getting as much out of this as I am!
> Yes, this is how I meant it, with the analyser part.
> I meant this that you would need this *only* if you also want the
> original 'free text indexing' of the property. Thus, if you would like
> to index some property both as the original jackrabbit indexing, but
> you also want a KeyWord like one, you need the property twice...but,
> normally, you don't need this.
> You're welcome.
>
> Thank you for reporting back that it works.
>
> Regards Ard
OK so it looks like I have one other issue. Using the configuration as 
posted below and sticking to my previous examples, with the addition of 
one with whitespace. With the following three in our repository:

    .North.South.East.WestLand
    .North.South.East.West_Land
    .North.South.East.West Land    //yes that's a space

...using a jcr:contains, with exact name search with NO wild cards: the 
first two return properly, but the last one yields no result.

    filter.addContains(@fullName, '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
Land") +"'));

According to the Lucene documentation, KeywordAnalyzer should be 
creating one token, plus combined with escaping the Illegal Characters 
(i.e. spaces), shouldn't this search work? Thanks again.

H. Wilson
>> H. Wilson
>>
>> repository.xml (modified both SearchIndex tags to include an
>> indexingConfiguration):
>>
>> <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>>
>> ....
>> <param name="indexingConfiguration"
>> value="${rep.home}/indexing_configuration.xml"/>
>>
>> </SearchIndex>
>>
>> indexing_configuration.xml:
>>
>> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
>>      <analyzers>
>>          <analyzer
>> class="org.mycompany.lucene.analysis.LowerCaseKeywordAnalyzer">
>>              <property>fullName</property>
>>          </analyzer>
>>      </analyzers>
>> </configuration>
>>
>> LowerCaseKeywordAnalyzer.java:
>>
>> package org.mycompany.lucene.analysis;
>>      import java.io.Reader;
>>      import org.apache.lucene.analysis.KeywordAnalyzer;
>>      import org.apache.lucene.analysis.LowerCaseFilter;
>>      import org.apache.lucene.analysis.TokenStream;
>>
>> public class LowerCaseKeywordAnalyzer extends KeywordAnalyzer {
>>
>>      public TokenStream tokenStream ( String field, final Reader reader  ) {
>>          TokenStream keywordTokenStream = super.tokenStream (field, reader);
>>          return ( new LowerCaseFilter ( keywordTokenStream ) );
>>      }
>> }
>>
>> Our search class has a method which then does the following:
>>
>> public OurParameter[] getOurParameters (String searchTerm, String srchField
>> ) { //srchField in this case was fullName
>>
>> TransientRepository repository = new TransientRepository ( OUR_REPO_CONFIG,
>> OUR_REPO_LOCATION);
>> Session session = repository.login ();
>> List<Class>  classes = new ArrayList<Class>();
>> classes.add (OurParameter.class);
>> Mapper mapper = new AnnotationMapperImpl (classes);
>> ObjectContentManager ocm = new ObjectContentManagerImpl (session, mapper);
>> queryManager = ocm.getQueryManager();
>> FilterImpl filter = (FilterImpl)queryManager.createFilter
>> (OurParameter.class);
>> filter.addContains ( srchField,
>> org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(searchTerm).replaceAll
>> ("'","''"));
>> // (that last was replace all single ticks with two ticks, I honestly can't
>> remember why though)
>> Query query = queryManager.createQuery (filter);
>> Collection<OurParameter>  resultsCollection =
>> (Collection<OurParameter>)ocm.getObjects(query);
>>
>> //convert to an array, do some other stuff, and return...
>>
>> }
>>
>>
>> On 08/26/2010 10:42 AM, Ard Schrijvers wrote:
>>
>> On Thu, Aug 26, 2010 at 3:53 PM, H. Wilson<wilsonh@randdss.com>  wrote:
>>
>>   Ard,
>>
>> I have this same problem, however my scenario involves underscores rather
>> than hyphens. Although since Chris seems to be seeing the same exact
>>
>> It is because hyphens just as underscores are tokens the Standard
>> Lucene Analyzer splits on. This combined with query expansion that
>> happens for wildcard searches in lucene causes your issuess:
>>
>> behavior as I was, I imagine we are both stuck on the same issue. After
>> scouring the forums for the solution, and not seeing your mentioned
>> solution, I actually posted my problem as detailed as possible here (
>> http://markmail.org/message/yh72wqd5b2hbr3j6 ) and received no response.
>> jcr:like was not an option for me, in this case, as our client wanted the
>> option for case-insensitive searches. Is there any chance you could please
>> narrow down where-about the post was which already covered this? Thanks for
>>
>> I can't seem to find my post again. But, I'll give you a quite simple
>> solution:
>>
>> If you want to have the normal indexing of the property for normal
>> searching, but also want to have the yyy* option, you need to
>> duplicate the property also in another property. If your property,
>> like
>>
>> .North.South.East.WestLand
>>
>> is only needed for the one you describe with wildcard searching, you
>> only need it once. Now, suppose, your property is called myProp.
>>
>> To your configuration.xml add:
>>
>> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
>>    <analyzers>
>>          <analyzer
>> class="org.mycompany.lucene.analysis.LowerCaseKeywordAnalyzer">
>>              <property>myProp</property>
>>          </analyzer>
>>    </analyzers>
>> </configuration>
>>
>> Your LowerCaseKeywordAnalyzer is very simple: it extends
>> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/KeywordAnalyzer.html
>> and in the method
>>
>>   TokenStream tokenStream(String fieldName,Reader reader)
>>
>> after calling the super, you invoke Lucene's LowerCaseFilter.
>>
>> That is all (after you do a re-index of your repository). Since now a
>> -, or _ or ~ or whatever is not seen as a token to split on, but you
>> still use lowercase filter, you can do exactly what you want.
>>
>> Do the words need the be split on spaces however? No problem, just add
>> a WhiteSpaceTokenizer from lucene. It is actually pretty simple,
>>
>> Hope this helps,
>>
>> Regards Ard
>>
>> your time.
>>
>> *H. Wilson*
>>
>>
>> On 08/26/2010 04:59 AM, Ard Schrijvers wrote:
>>
>> Hello,
>>
>> You can search the archives (mail from me) for wildcard searching
>> things related below. There was someone having similar issues. I
>> explained the wildcard difficulties. Take a look at jcr:like for your
>> usecases
>>
>> Regards Ard
>>
>> On Thu, Aug 26, 2010 at 10:19 AM, Dunstall, Christopher
>> <cdunstall@csu.edu.au>    wrote:
>>
>> Hi all,
>>
>> I'm having some trouble with an XPath query, where I'm searching for
>> users with hyphens in their name.
>>
>> I'm using:
>> jcr:contains(*/*/*,'query')
>>
>> And it returns some odd results.
>>
>> I have two users, Sophie-Allen and Sophie-Anne. When I search for
>> 'sophie', I get back users back. Ok, fine, but if I search for 'sophie-a'
>> (with the hyphen escaped as 'sophie\-a' as per the JSR-170 Spec) I get zero
>> results returned.  Oddly, if I search for either 'sophie-allen' or
>> 'sophie-anne' I get the respective user details back fine. Shouldn't I get
>> both users back when escaping the hyphen? Have I missed something in the
>> spec?
>>
>> One other odd thing is the addition of an asterisk (*).  Searching for
>> 'soph' and 'soph*' return the same result (both users), but if I search for
>> 'sophie-allen*', I get zero results, unlike when searching for just
>> 'sophie-allen'. Searching for 'sophie-a*' has the same result as without the
>> asterisk, i.e. nothing.
>>
>> The JSR-170 spec doesn't say anything (that I can find) but is the
>> asterisk a wildcard in the jcr:contains function or does it serve some other
>> purpose?
>>
>> Your assistance is greatly appreciated,
>>
>> Regards,
>>
>> Chris Dunstall | Service Support - Applications
>> Technology Integration/OLE Virtual Team
>> Division of Information Technology | Charles Sturt University | Bathurst,
>> NSW, Australia
>>
>> Ph: 02 63384818 | Fax: 02 63384181
>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message