lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Beady Geraghty <>
Subject Re: words with more than 1 hyphen ?
Date Thu, 08 Dec 2005 15:15:19 GMT
Thank you for your answer.

I would like to not give you a "general" question so that I can
understand more.
But, I have random requests from people.  For example,
this request for hyphen is originated from a colleaque who is French,
and she believes that hyphen is important, though, I don't
know whether existing users use hyphen or not.
It depends on who the users are.

Since someone suggested hyphen, the next requestion
is underscore.  I can see more and more of these requests.
Also, people might like to search  for "/usr/include/wchar.h"  (hence,
the slash) and apostrophe etc. There really isn't a set of requirements
upfront. In fact people wants EVERYTHING if
they could, and full flexibility (even though they don't know
whether they will need it or not.)
So it appears that doing something "general" is better.

I have been using StandardAnalyzer for the things you mentioned, like
email address, and or i.b.m.  Those are good things
for me to have.  Since I've used it now, if I change it now, I might break
people's dependencies.

If you do have a list of pitfalls from javaCC, could you point me to it,
that way, I can think about some of the potential issues and decide
whether I should just abandon using javaCC ?


On 12/7/05, Erik Hatcher <> wrote:
> On Dec 7, 2005, at 9:08 PM, Beady Geraghty wrote:
> > In general, do the rules in javaCC work pretty well.
> In general, all answers would be too general to be useful :)
> JavaCC is great - I'm using it for a custom query parser myself.  But
> it's not for the feint of heart.  It may be more than you need, it
> all depends.  The main thing StandardTokenizer does is keep e-mail
> addresses intact, and a few other fiddly things.
> If you provide us with some sample text and how you want that
> tokenized, I'm sure we could offer suggestions.
> >   Since
> > there may be more requests  to be included punctuations
> > in the search terms, so I have to keep modifying this .jj file.
> > I wonder if there are things that I should watch out for before
> > getting overly complicated and get stuck somewhere down the
> > road ?
> There are many pitfalls with JavaCC grammars.  It takes practice and
> unit tests to get this stuff right.  The same could be said of any
> style of tokenization.  Make lots of tests to ensure you don't break
> expected behavior as you tweak.
>        Erik
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message