lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Any Tokenizator friendly to C++, C#, .NET, etc ?
Date Thu, 20 Aug 2009 16:46:03 GMT
Hi Valery,

 From our experience at Krugle, we wound up having to create our own  
tokenizers (actually kind of  specialized parser) for the different  
languages. It didn't seem like a good option to try to twist one of  
the existing tokenizers into something that would work well enough. We  
wound up using ANTLR for this.

-- Ken


On Aug 20, 2009, at 8:09am, Valery wrote:

>
> Hi Robert,
>
> thanks for the hint.
>
> Indeed, a natural way to go. Especially if one builds a Tokenizer of  
> the
> level of quality like StandardTokenizer's.
>
> OTOH, you mean that the out-of-the-box stuff is indeed not  
> customizable for
> this task?..
>
> regards
> Valery
>
>
>
> Robert Muir wrote:
>>
>> Valery,
>>
>> One thing you could try would be to create a JFlex-based tokenizer,
>> specifying a grammar with the rules you want.
>> You could use the source code & grammar of StandardTokenizer as a
>> starting point.
>>
>>
>> On Thu, Aug 20, 2009 at 10:28 AM, Valery<khamenya@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> I am trying to tune Lucene to respect such tokens like C++, C#, .NET
>>>
>>> The task is known for Lucene community, but surprisingly I can't  
>>> google
>>> out
>>> somewhat good info on it.
>>>
>>> Of course, I tried to re-use Lucene's  building blocks for  
>>> Tokenizer.
>>> Here
>>> we go:
>>>
>>>  1) StandardTokenizer -- oh, this option would be just fantastic,  
>>> but
>>> "C++,
>>> C#, .NET" ends up with "c c net". Too bad.
>>>
>>>  2) WhitespaceTokenizer gives me a lot of lexems that are actually  
>>> should
>>> have been chopped into smaller pieces. Example: "C/C++" comes out  
>>> like a
>>> single lexem. If I follow this way I end-up with "Tokenization of  
>>> tokens"
>>> --
>>> that sounds a bit odd, doesn't it?
>>>
>>>  3) CharTokenizer allows me to add the '/' to be also a token- 
>>> emitting
>>> char, but then '/' gets immediately lost like those whitespace  
>>> chars. In
>>> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to  
>>> search
>>> the
>>> original char stream for the "/" char to re-build "SAP R/3" term  
>>> as a
>>> whole.
>>>
>>> Do you see any other relevant building blocks missed by me?
>>>
>>> Also, people around there have meant that such problem should be  
>>> solved
>>> by a
>>> synonym dictionary. However this hint sheds no light on which
>>> tokenization
>>> strategy should be more appropriate *before* the synonym step.
>>>
>>> So, it looks like I have to take the class CharTokenizer as for the
>>> starting
>>> point and write anew my own Tokenizer. This Tokenizer should also  
>>> react
>>> on
>>> delimiting characters and emit the token. However, it should  
>>> distinguish
>>> between delimiters like whitespaces along with ";,?" and the  
>>> delimiters
>>> like
>>> "./&".
>>>
>>> Indeed, the delimiters like whitespaces and ";,?" should be thrown  
>>> away
>>> from
>>> Lexem level,
>>> whereas the token emitting characters like "./&" should be kept in  
>>> Lexem
>>> level.
>>>
>>> Your comments, gurus?
>>>
>>> regards,
>>> Valery
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
>>> Sent from the Lucene - Java Users mailing list archive at  
>>> Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> -- 
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message