lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?
Date Tue, 14 Sep 2010 21:48:29 GMT
Jonathan, you bring up an excellent point.

I think its worth our time to actually benchmark this LowerCaseTokenizer
versus LetterTokenizer + LowerCaseFilter

This tokenizer is quite old, and although I can understand there is no doubt
its technically faster than LetterTokenizer + LowerCaseFilter even today (as
it can just go through the char[] only a single time), I have my doubts that
this brings any value these days...


On Tue, Sep 14, 2010 at 5:23 PM, Jonathan Rochkind <rochkind@jhu.edu> wrote:

> Why would you want to do that, instead of just using another tokenizer and
> a lowercasefilter?  It's more confusing less DRY code to leave them separate
> -- the LowerCaseTokenizerFactory  combines anyway because someone decided it
> was such a common use case that it was worth it for the demonstrated
> performance advantage. (At least I hope that's what happened, otherwise
> there's no excuse for it!).
>
> Do you know you get a worthwhile performance benefit for what you're doing?
>  If not, why do it?
>
> Jonathan
>
>
> Scott Gonyea wrote:
>
>> I went for a different route:
>>
>> https://issues.apache.org/jira/browse/LUCENE-2644
>>
>> Scott
>>
>> On Tue, Sep 14, 2010 at 11:18 AM, Robert Muir <rcmuir@gmail.com> wrote:
>>
>>
>>
>>> On Tue, Sep 14, 2010 at 1:54 PM, Scott Gonyea <scott@aitrus.org> wrote:
>>>
>>>
>>>
>>>> Hi,
>>>>
>>>> I'm tweaking my schema and the LowerCaseTokenizerFactory doesn't create
>>>> tokens, based solely on lower-casing characters.  Is there a way to tell
>>>>
>>>>
>>> it
>>>
>>>
>>>> NOT to drop non-characters?  It's amazingly frustrating that the
>>>> TokenizerFactory and the FilterFactory have two entirely different modes
>>>>
>>>>
>>> of
>>>
>>>
>>>> behavior.  If I wanted it to tokenize based on non-lower case
>>>> characters....
>>>> wouldn't I use, say, LetterTokenizerFactory and tack on the
>>>> LowerCaseFilterFactory?  Or any number of combinations that would
>>>>
>>>>
>>> otherwise
>>>
>>>
>>>> achieve that specific end-result?
>>>>
>>>>
>>>>
>>> I don't think you should use LowerCaseTokenizerFactory if you dont want
>>> to
>>> divide text on non-letters, its intended to do just that.
>>>
>>> from the javadocs:
>>> LowerCaseTokenizer performs the function of LetterTokenizer and
>>> LowerCaseFilter together. It divides text at non-letters and converts
>>> them
>>> to lower case. While it is functionally equivalent to the combination of
>>> LetterTokenizer and LowerCaseFilter, there is a performance advantage to
>>> doing the two tasks at once, hence this (redundant) implementation.
>>>
>>>
>>>
>>> So... Is there a way for me to tell it to NOT split based on
>>> non-characters?
>>>    Use a different tokenizer that doesn't split on non-characters,
>>> followed by
>>> a LowerCaseFilter
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>>
>>>
>>
>>
>>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message