lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Enhance StandardTokenizer to support words which will not be tokenized
Date Wed, 03 Jun 2009 16:10:44 GMT
You'd have to modify the JFlex grammar.  I'd suggest adding in a  
generic "protected words" approach whereby you can pass in a list of  
protected words.

This would be a nice patch/improvement.


On Jun 3, 2009, at 4:07 AM, ami dudu wrote:

> Hi, I'm using a StandardTokenizer which do great job for me but i  
> need to
> enhance it somehow to consider words like "c++" "c#", ".net" as is  
> and not
> tokenized it into "c" or "net".
> I know that there are other tokenizers such as KeywordTokenizer and
> WhitespaceTokenizer but they do not include the StandardTokenizer   
> logic.
> Any ideas on what is the best way to add this enhancement?
> Thanks,
> Amid
> -- 
> View this message in context:
> Sent from the Lucene - Java Developer mailing list archive at  
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message