lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Valery <>
Subject Re: Any Tokenizator friendly to C++, C#, .NET, etc ?
Date Fri, 21 Aug 2009 10:51:29 GMT

Hi John, 

(aren't you the same John Byrne who is a key contributor to the great
OpenSSI project?)

John Byrne-3 wrote:
> I'm inclined to disagree with the idea that a token should not be split 
> again downstream. I think that is actually a much easier way to handle 
> it. I would have the tokenizer return the longest match, and then split 
> it in a token filter. In fact I have dones this before and it has worked 
> fine for me.

well, I could soften my position: if the token re-parsing is done by looking
into currentlexem value only, then it might be perhaps accepted. In
contrast, if during your re-parsing you have to look into the upstream
characters data "several filters backwards", then, IMHO, it is rather messy
and unacceptable. 

Regarding this part:

John Byrne-3 wrote:
> I think you will have to maintain some state within the token filter 
> [...]

I would wait for Simon's answer to the question "What do you expect from the

Then I will give my 2cents on this and perhaps then I could sum up all
opinions and adopt a common conclusion.


View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message