lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: words with more than 1 hyphen ?
Date Thu, 08 Dec 2005 02:26:19 GMT

On Dec 7, 2005, at 9:08 PM, Beady Geraghty wrote:
> In general, do the rules in javaCC work pretty well.

In general, all answers would be too general to be useful :)

JavaCC is great - I'm using it for a custom query parser myself.  But  
it's not for the feint of heart.  It may be more than you need, it  
all depends.  The main thing StandardTokenizer does is keep e-mail  
addresses intact, and a few other fiddly things.

If you provide us with some sample text and how you want that  
tokenized, I'm sure we could offer suggestions.

>   Since
> there may be more requests  to be included punctuations
> in the search terms, so I have to keep modifying this .jj file.
> I wonder if there are things that I should watch out for before
> getting overly complicated and get stuck somewhere down the
> road ?

There are many pitfalls with JavaCC grammars.  It takes practice and  
unit tests to get this stuff right.  The same could be said of any  
style of tokenization.  Make lots of tests to ensure you don't break  
expected behavior as you tweak.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message