lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: How to tokenize with comma in standard tokenizer
Date Mon, 17 Sep 2007 15:04:50 GMT
Take the comma out of: | <#P: ("_"|"-"|"/"|"."|",") > in the .jj file 
(around line 92). Keep in mind that this will affect being able to find 
tokens that where previously indexed with the comma there (obviously). I 
believe the javacc target in the build file will rebuild...you need to 
get javacc and put a prop file next to the build file called 
build.properties that contains: javacc.home=c:/javacc (or wherever you 
put javacc).

Also, you could consider trying to pre-process the strings (replace the 
comma with a space or something).

- Mark

Bhavin Pandya wrote:
> Hi,
>
> Standard tokenizer works pretty well for me... but i found one problem with my usage...
>
> I want to tokenize..."TheRing6,Proposal6,GuyandGirl6" as a three saparate tokens.. while
standard analyzer considering it as a one word because it has one digit in token.
>
> Expected three tokens:
> 1. thering6
> 2. proposal6
> 3. guyandgirl6
>
> i want to change this behaviour of standard tokenizer for this purpose.... But i dont
know where to change....
> Do i need to comment some rule in StandardTokenizer.jj file ???  I am confused with this
file....
>
> Any pointer...
>
> - Bhavin
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message