lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: words with more than 1 hyphen ?
Date Thu, 08 Dec 2005 16:15:08 GMT

On Dec 8, 2005, at 10:15 AM, Beady Geraghty wrote:
> Since someone suggested hyphen, the next requestion
> is underscore.  I can see more and more of these requests.
> Also, people might like to search  for "/usr/include/wchar.h"  (hence,
> the slash) and apostrophe etc. There really isn't a set of  
> requirements
> upfront. In fact people wants EVERYTHING if
> they could, and full flexibility (even though they don't know
> whether they will need it or not.)
> So it appears that doing something "general" is better.

Keep in mind that even if you tokenize "/usr/include" as "usr" and  
"include" that a query for "/usr/include" will still match if it is  
analyzed in a compatible manner.   It is important to realize that  
just because bits and pieces get eaten during analysis that it  
doesn't make things unfindable.  Sure, if someone literally wants to  
search for "/" only then it is important to keep upfront, but  
generally this is not the case.

> I have been using StandardAnalyzer for the things you mentioned, like
> email address, and or i.b.m.  Those are good things
> for me to have.  Since I've used it now, if I change it now, I  
> might break
> other
> people's dependencies.

WhitespaceTokenizer followed by a LowerCaseFilter also catches the  
cases you mention.

Again, lots of tests to assert what the users expect is a strong  
recommendation I have with tokenization.

> If you do have a list of pitfalls from javaCC, could you point me  
> to it,
> that way, I can think about some of the potential issues and decide
> whether I should just abandon using javaCC ?

It all really depends.  JavaCC is complex, that is my only  
reservation for recommending it.  If you can get by with simpler  
analysis techniques, then I say go for those instead.

But, JavaCC is powerful.  It simply isn't necessary for the bulk of  
analysis cases I've come across.  Works great for parsing query  
expressions though.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message