lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Valery <khame...@gmail.com>
Subject Re: Any Tokenizator friendly to C++, C#, .NET, etc ?
Date Fri, 21 Aug 2009 10:51:29 GMT

Hi John, 

(aren't you the same John Byrne who is a key contributor to the great
OpenSSI project?)


John Byrne-3 wrote:
> 
> I'm inclined to disagree with the idea that a token should not be split 
> again downstream. I think that is actually a much easier way to handle 
> it. I would have the tokenizer return the longest match, and then split 
> it in a token filter. In fact I have dones this before and it has worked 
> fine for me.
> 

well, I could soften my position: if the token re-parsing is done by looking
into currentlexem value only, then it might be perhaps accepted. In
contrast, if during your re-parsing you have to look into the upstream
characters data "several filters backwards", then, IMHO, it is rather messy
and unacceptable. 


Regarding this part:

John Byrne-3 wrote:
> 
> I think you will have to maintain some state within the token filter 
> [...]
> 

I would wait for Simon's answer to the question "What do you expect from the
Tokenizer?"

Then I will give my 2cents on this and perhaps then I could sum up all
opinions and adopt a common conclusion.
:)

regards
Valery

-- 
View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25076151.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message