lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: words with more than 1 hyphen ?
Date Wed, 07 Dec 2005 20:24:56 GMT
> 1. I modified the StandardTokenizer.jj file.
>    Essentially, I added the following to StandardTokenizer.jj
>  | <HYPHENWORD1: (<LETTER>)+"-"(<LETTER>)+("-"<LETTER>)*>

Is that the only change you made to the .jj file?   Where did you put  
that exactly?

Don't you need a * after the second <LETTER>?

> 4. I was able to index and retrieve words like
> merry-go-round (as oppose to merry go round).  So, I
> was quite happy.
> Now I want to get "merry-go-round" from the token
> stream.  And that doesn't seem to work.
> Note that retrieve words with 1 hyphen seems to work,
> but 2 hyphens seems to represent a problem.
> In getting the tokens from the stream, I get
> "Merry-go-r" and "ound"  instead of "Merry-go-round"
> "editor-in-c" and "hief"  instead of "editor-in-chief".
> This behaviour is so strange, and I don't know how
> the indexer and query processing knows about "merry-go-round",
> and yet the TokenStream doesn't.

I think the missing * above explains what you're seeing.

> "green-monster" would work.  But not words with more than
> one hyphen.

I'm surprised this one worked - maybe some other token in JavaCC is  
catching that?

JavaCC is perhaps overkill for what you want.  If you don't need any  
of the other fancy analysis tricks that StandardTokenizer has, you  
could just use WhiteSpaceAnalyzer, LowerCaseFilter, voila, your  
hyphenated tokens would come right out.

> (By the way, currently, I convert a hyphenated word into a phrase,
> but to me, that seems like special casing hyphenated words, and I
> just want to stay away from special casing.  People has been asking
> for all sorts of punctuation, such as _ or / etc.  I thought that  
> if I learn
> how to do modify the .jj files and produce the right tokens, I am  
> better
> off.

Unless you need the other features of StandardTokenizer, you may be  
best staying away from JavaCC altogether.  It is it's own complex  
world that might be more than what you need.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message