lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <>
Subject Re: How to tokenize with comma in standard tokenizer
Date Mon, 17 Sep 2007 15:04:50 GMT
Take the comma out of: | <#P: ("_"|"-"|"/"|"."|",") > in the .jj file 
(around line 92). Keep in mind that this will affect being able to find 
tokens that where previously indexed with the comma there (obviously). I 
believe the javacc target in the build file will need to 
get javacc and put a prop file next to the build file called that contains: javacc.home=c:/javacc (or wherever you 
put javacc).

Also, you could consider trying to pre-process the strings (replace the 
comma with a space or something).

- Mark

Bhavin Pandya wrote:
> Hi,
> Standard tokenizer works pretty well for me... but i found one problem with my usage...
> I want to tokenize..."TheRing6,Proposal6,GuyandGirl6" as a three saparate tokens.. while
standard analyzer considering it as a one word because it has one digit in token.
> Expected three tokens:
> 1. thering6
> 2. proposal6
> 3. guyandgirl6
> i want to change this behaviour of standard tokenizer for this purpose.... But i dont
know where to change....
> Do i need to comment some rule in StandardTokenizer.jj file ???  I am confused with this
> Any pointer...
> - Bhavin

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message