lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Beady Geraghty <beadygerag...@gmail.com>
Subject Re: words with more than 1 hyphen ?
Date Thu, 08 Dec 2005 02:08:08 GMT
Hi Erik,

Thank you so much for pointing out the error :-)
It should have been

 | <HYPHENWORD1: (<LETTER>)+"-"(<LETTER>)+("-"(<LETTER>)+)*>

I missed a pair of brackets for the 3rd LETTER (and a +)
I wonder how my indexer and query parser worked before,
but not the token stream.  Anyhow, it seems to work with both
indexing/query parsing and reading off token stream.

In general, do the rules in javaCC work pretty well.  Since
there may be more requests  to be included punctuations
in the search terms, so I have to keep modifying this .jj file.
I wonder if there are things that I should watch out for before
getting overly complicated and get stuck somewhere down the
road ?

Thanks again.



On 12/7/05, Erik Hatcher <erik@ehatchersolutions.com> wrote:
>
> O
> > 1. I modified the StandardTokenizer.jj file.
> >
> >    Essentially, I added the following to StandardTokenizer.jj
> >  | <HYPHENWORD1: (<LETTER>)+"-"(<LETTER>)+("-"<LETTER>)*>
>
> Is that the only change you made to the .jj file?   Where did you put
> that exactly?
>
> Don't you need a * after the second <LETTER>?
>
> > 4. I was able to index and retrieve words like
> > merry-go-round (as oppose to merry go round).  So, I
> > was quite happy.
> > Now I want to get "merry-go-round" from the token
> > stream.  And that doesn't seem to work.
> > Note that retrieve words with 1 hyphen seems to work,
> > but 2 hyphens seems to represent a problem.
> >
> > In getting the tokens from the stream, I get
> > "Merry-go-r" and "ound"  instead of "Merry-go-round"
> > "editor-in-c" and "hief"  instead of "editor-in-chief".
> > This behaviour is so strange, and I don't know how
> > the indexer and query processing knows about "merry-go-round",
> > and yet the TokenStream doesn't.
>
> I think the missing * above explains what you're seeing.
>
> > "green-monster" would work.  But not words with more than
> > one hyphen.
>
> I'm surprised this one worked - maybe some other token in JavaCC is
> catching that?
>
> JavaCC is perhaps overkill for what you want.  If you don't need any
> of the other fancy analysis tricks that StandardTokenizer has, you
> could just use WhiteSpaceAnalyzer, LowerCaseFilter, voila, your
> hyphenated tokens would come right out.
>
> > (By the way, currently, I convert a hyphenated word into a phrase,
> > but to me, that seems like special casing hyphenated words, and I
> > just want to stay away from special casing.  People has been asking
> > for all sorts of punctuation, such as _ or / etc.  I thought that
> > if I learn
> > how to do modify the .jj files and produce the right tokens, I am
> > better
> > off.
>
> Unless you need the other features of StandardTokenizer, you may be
> best staying away from JavaCC altogether.  It is it's own complex
> world that might be more than what you need.
>
>        Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message