lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Valery <khame...@gmail.com>
Subject Any Tokenizator friendly to C++, C#, .NET, etc ?
Date Thu, 20 Aug 2009 14:28:08 GMT

Hi all, 

I am trying to tune Lucene to respect such tokens like C++, C#, .NET

The task is known for Lucene community, but surprisingly I can't google out
somewhat good info on it.

Of course, I tried to re-use Lucene's  building blocks for Tokenizer. Here
we go:

  1) StandardTokenizer -- oh, this option would be just fantastic, but "C++,
C#, .NET" ends up with "c c net". Too bad.

  2) WhitespaceTokenizer gives me a lot of lexems that are actually should
have been chopped into smaller pieces. Example: "C/C++" comes out like a
single lexem. If I follow this way I end-up with "Tokenization of tokens" --
that sounds a bit odd, doesn't it?

  3) CharTokenizer allows me to add the '/' to be also a token-emitting
char, but then '/' gets immediately lost like those whitespace chars. In
result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search the
original char stream for the "/" char to re-build "SAP R/3" term as a whole.

Do you see any other relevant building blocks missed by me?

Also, people around there have meant that such problem should be solved by a
synonym dictionary. However this hint sheds no light on which tokenization
strategy should be more appropriate *before* the synonym step.

So, it looks like I have to take the class CharTokenizer as for the starting
point and write anew my own Tokenizer. This Tokenizer should also react on
delimiting characters and emit the token. However, it should distinguish
between delimiters like whitespaces along with ";,?" and the delimiters like
"./&". 

Indeed, the delimiters like whitespaces and ";,?" should be thrown away from
Lexem level, 
whereas the token emitting characters like "./&" should be kept in Lexem
level.

Your comments, gurus?

regards,
Valery

-- 
View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message