lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Klaas <mike.kl...@gmail.com>
Subject Re: thoughts/suggestions for analyzing/tokenizing class names
Date Tue, 18 Dec 2007 04:53:54 GMT

On 17-Dec-07, at 11:39 AM, Beyer,Nathan wrote:

> Would using Field.Index.UN_TOKENIZED be the same as tokenizing a field
> into one token?

Indeed.

-Mike

>
> -----Original Message-----
> From: Mike Klaas [mailto:mike.klaas@gmail.com]
> Sent: Monday, December 17, 2007 12:53 PM
> To: java-user@lucene.apache.org
> Subject: Re: thoughts/suggestions for analyzing/tokenizing class names
>
> Either index them as a series of tokens:
>
> org
> org.apache
> org.apache.lucene
> org.apache.lucene.document
> org.apache.lucene.document.Document
>
> or index them as a single token, and use prefix queries (this is what
> I do for reverse domain names):
>
> classname:(org.apache org.apache.*)
>
> Note that "classname:org.apache*" would probably be wrong--you might
> not want to match
>
> org.apache-fake.lucene.document
>
> regards,
> -Mike
>
> On 17-Dec-07, at 9:39 AM, Beyer,Nathan wrote:
>
>> Good point.
>>
>> I don't want the sub-package names on their own to match.
>>
>> Text (class name)
>>  - "org.apache.lucene.document.Document"
>> Queries that would match
>>  - "org.apache", "org.apache.lucene.document"
>> Queries that DO NOT match
>>  - "apache", "lucene", "document"
>>
>> -Nathan
>>
>> -----Original Message-----
>> From: Mike Klaas [mailto:mike.klaas@gmail.com]
>> Sent: Monday, December 17, 2007 11:29 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: thoughts/suggestions for analyzing/tokenizing class  
>> names
>>
>> On 15-Dec-07, at 3:14 PM, Beyer,Nathan wrote:
>>
>>> I have a few fields that use package names and class names and I've
>>> been
>>> looking for some suggestions for analyzing these fields.
>>>
>>> A few examples -
>>>
>>> Text (class name)
>>> - "org.apache.lucene.document.Document"
>>> Queries that would match
>>> - "org.apache" , "org.apache.lucene.document"
>>>
>>> Text (class name + method signature)
>>> -- "org.apache.lucene.document.Document#add(Fieldable)"
>>> Queries that would match
>>> -- "org.apache.lucene", "org.apache.lucene.document.Document#add"
>>>
>>> Any thoughts on how to approach tokenizing these types of texts?
>>
>> Perhaps it would help to include some examples of queries you _don't_
>> want to match.  For all the examples above, simply tokenizing
>> alphanumeric components would suffice.
>>
>> -Mike
>>
>> --------------------------------------------------------------------- 
>> -
>> CONFIDENTIALITY NOTICE This message and any included attachments
>> are from Cerner Corporation and are intended only for the
>> addressee. The information contained in this message is
>> confidential and may constitute inside or non-public information
>> under international, federal, or state securities laws.
>> Unauthorized forwarding, printing, copying, distribution, or use of
>> such information is strictly prohibited and may be unlawful. If you
>> are not the addressee, please promptly delete this message and
>> notify the sender of the delivery error by e-mail or you may call
>> Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)
>> (816)221-1024.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ----------------------------------------------------------------------
> CONFIDENTIALITY NOTICE This message and any included attachments  
> are from Cerner Corporation and are intended only for the  
> addressee. The information contained in this message is  
> confidential and may constitute inside or non-public information  
> under international, federal, or state securities laws.  
> Unauthorized forwarding, printing, copying, distribution, or use of  
> such information is strictly prohibited and may be unlawful. If you  
> are not the addressee, please promptly delete this message and  
> notify the sender of the delivery error by e-mail or you may call  
> Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)  
> (816)221-1024.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message