lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Beady Geraghty <beadygerag...@gmail.com>
Subject Re: words with more than 1 hyphen ?
Date Thu, 08 Dec 2005 17:41:34 GMT
Thanks for the advice.

It is hard to say whether the useability folks want
to distinguish between "/usr/include" as oppose to "usr include".
Actually, I am sure that they would, but whether they would
accept "usr include" is the right question to ask :-)
I'll have to sort it out with them :-(

Thanks again.


On 12/8/05, Erik Hatcher <erik@ehatchersolutions.com> wrote:
>
>
> On Dec 8, 2005, at 10:15 AM, Beady Geraghty wrote:
> > Since someone suggested hyphen, the next requestion
> > is underscore.  I can see more and more of these requests.
> > Also, people might like to search  for "/usr/include/wchar.h"  (hence,
> > the slash) and apostrophe etc. There really isn't a set of
> > requirements
> > upfront. In fact people wants EVERYTHING if
> > they could, and full flexibility (even though they don't know
> > whether they will need it or not.)
> > So it appears that doing something "general" is better.
>
> Keep in mind that even if you tokenize "/usr/include" as "usr" and
> "include" that a query for "/usr/include" will still match if it is
> analyzed in a compatible manner.   It is important to realize that
> just because bits and pieces get eaten during analysis that it
> doesn't make things unfindable.  Sure, if someone literally wants to
> search for "/" only then it is important to keep upfront, but
> generally this is not the case.
>
> > I have been using StandardAnalyzer for the things you mentioned, like
> > email address, and www.google.com or i.b.m.  Those are good things
> > for me to have.  Since I've used it now, if I change it now, I
> > might break
> > other
> > people's dependencies.
>
> WhitespaceTokenizer followed by a LowerCaseFilter also catches the
> cases you mention.
>
> Again, lots of tests to assert what the users expect is a strong
> recommendation I have with tokenization.
>
> > If you do have a list of pitfalls from javaCC, could you point me
> > to it,
> > that way, I can think about some of the potential issues and decide
> > whether I should just abandon using javaCC ?
>
> It all really depends.  JavaCC is complex, that is my only
> reservation for recommending it.  If you can get by with simpler
> analysis techniques, then I say go for those instead.
>
> But, JavaCC is powerful.  It simply isn't necessary for the bulk of
> analysis cases I've come across.  Works great for parsing query
> expressions though.
>
>        Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message