lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@googlemail.com>
Subject Re: Any Tokenizator friendly to C++, C#, .NET, etc ?
Date Fri, 21 Aug 2009 12:43:10 GMT
On Fri, Aug 21, 2009 at 2:18 PM, Valery<khamenya@gmail.com> wrote:
>
>
> Simon Willnauer wrote:
>>
>> I already responded... again...
>>
> sorry, I've been in answering and seen your post right after sending.
>
>
> Simon Willnauer wrote:
>>
>> Tokenizer splits the input stream into tokens (Token.java) and
>> TokenFilter subclasses operate on those. I expect from a Tokenizer
>> that is provides me a stream of tokens :) - how those tokens are
>> created is the responsibility of the Tokenizer.
>
> According to your requirements:
>
>  * one programmer will write a simplistic Tokenizer that converts a whole
> char input into a 1 huge token.
>
>  * another programmer will write a simplistic Tokenizer that converts each
> single char of the input into a 1-char token.  It will end up in a huge
> number of 1-char tokens.
>
> Moreoever, both claim the job is done in a brilliant way, because the
> Tokenizer is based on a 1-line statement in Java...
>
> Who did the work better?
>
> Said that, I'd love to hear more specific requirements about Tokenizer to
> avoid the above odd deliveries :)
The answer is again "it depends"  if you need two tokenizers one
creating tokens by dividing at non-lettser and another one dividing at
whitespaces a Tokenizer that output every single char is a good super
class for those two.
See LetterTokenizer / WhitespaceTokenizer and their common superclass
CharTokenizer.

Asking the question who did a better job is not valid without
specifying the requirements. Anyway, does WhitespaceTokenizer solve
your problem?!
As Robert said - have a look at the smartcn stuff this is the other
extreme - it always depends.

simon
>
> regards
> Valery
>
> --
> View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25078755.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message