lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Klaas <mike.kl...@gmail.com>
Subject Re: thoughts/suggestions for analyzing/tokenizing class names
Date Mon, 17 Dec 2007 18:52:48 GMT
Either index them as a series of tokens:

org
org.apache
org.apache.lucene
org.apache.lucene.document
org.apache.lucene.document.Document

or index them as a single token, and use prefix queries (this is what  
I do for reverse domain names):

classname:(org.apache org.apache.*)

Note that "classname:org.apache*" would probably be wrong--you might  
not want to match

org.apache-fake.lucene.document

regards,
-Mike

On 17-Dec-07, at 9:39 AM, Beyer,Nathan wrote:

> Good point.
>
> I don't want the sub-package names on their own to match.
>
> Text (class name)
>  - "org.apache.lucene.document.Document"
> Queries that would match
>  - "org.apache", "org.apache.lucene.document"
> Queries that DO NOT match
>  - "apache", "lucene", "document"
>
> -Nathan
>
> -----Original Message-----
> From: Mike Klaas [mailto:mike.klaas@gmail.com]
> Sent: Monday, December 17, 2007 11:29 AM
> To: java-user@lucene.apache.org
> Subject: Re: thoughts/suggestions for analyzing/tokenizing class names
>
> On 15-Dec-07, at 3:14 PM, Beyer,Nathan wrote:
>
>> I have a few fields that use package names and class names and I've
>> been
>> looking for some suggestions for analyzing these fields.
>>
>> A few examples -
>>
>> Text (class name)
>> - "org.apache.lucene.document.Document"
>> Queries that would match
>> - "org.apache" , "org.apache.lucene.document"
>>
>> Text (class name + method signature)
>> -- "org.apache.lucene.document.Document#add(Fieldable)"
>> Queries that would match
>> -- "org.apache.lucene", "org.apache.lucene.document.Document#add"
>>
>> Any thoughts on how to approach tokenizing these types of texts?
>
> Perhaps it would help to include some examples of queries you _don't_
> want to match.  For all the examples above, simply tokenizing
> alphanumeric components would suffice.
>
> -Mike
>
> ----------------------------------------------------------------------
> CONFIDENTIALITY NOTICE This message and any included attachments  
> are from Cerner Corporation and are intended only for the  
> addressee. The information contained in this message is  
> confidential and may constitute inside or non-public information  
> under international, federal, or state securities laws.  
> Unauthorized forwarding, printing, copying, distribution, or use of  
> such information is strictly prohibited and may be unlawful. If you  
> are not the addressee, please promptly delete this message and  
> notify the sender of the delivery error by e-mail or you may call  
> Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)  
> (816)221-1024.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message