lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller" <markrmil...@gmail.com>
Subject Re: Phrase search using quotes -- special Tokenizer
Date Fri, 01 Sep 2006 22:21:37 GMT
Sorry to hear you're having trouble. You indeed need the double quotes in
the source text. You will also need them in the query string. Make sure they
are in both places. My machine is hosed right now or I would do it for you
real quick. My guess is that I forgot to mention...no only do you need to
add the <QUOTED> definiton to the TOKEN section, but below that you will
find the grammer...you need to add <QUOTED> to the grammer. If you look how
<NUM> and <APOSTROPHE> are done you will prob see what you should do. If
not, my machine should be back up tomarrow...

- Mark

On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>
>
> Well, I tried that, and it doesn't seem to work still.  I would be happy
> to
> zip up the new files, so you can see what I'm using -- maybe you can get
> it
> to work.  The first time, I tried building the documents without quotes
> surrounding each phrase.  Then, I retried by enclosing every phrase within
> double quotes.  Neither seemed to work.  When constructing the query
> string
> for the search, I always added the double quotes (otherwise, it'd think it
> was multiple terms).  (I didn't even test the underscore and hyphenated
> terms.)  I thought Lucene was (sort of by default) set up to search quoted
> phrases.  From http://lucene.apache.org/java/docs/api/index.html --> A
> Phrase is a group of words surrounded by double quotes such as "hello
> dolly".  So, this should be easy, right?  I must be missing something
> stupid.
>
> Thanks,
>
> Philip
>
>
> Mark Miller-5 wrote:
> >
> > So this will recognize anything in quotes as a single token and '_' and
> > '-' will not break up words. There may be some repercussions for the NUM
> > token but nothing I'd worry about. maybe you want to use Unicode for '-'
> > and '_' as well...I wouldn't worry about it myself.
> >
> > - Mark
> >
> >
> > TOKEN : {                      // token patterns
> >
> >   // basic word: a sequence of digits & letters
> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
> >
> > | <QUOTED:     "\"" (~["\""])+ "\"">
> >
> >   // internal apostrophes: O'Reilly, you're, O'Reilly's
> >   // use a post-filter to remove possesives
> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
> >
> >   // acronyms: U.S.A., I.B.M., etc.
> >   // use a post-filter to remove dots
> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
> >
> >   // company names like AT&T and Excite@Home.
> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
> >
> >   // email addresses
> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
> > (("."|"-") <ALPHANUM>)+ >
> >
> >   // hostname
> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
> >
> >   // floating point, serial, model numbers, ip addresses, etc.
> >   // every other segment must have at least one digit
> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
> >        | <HAS_DIGIT> <P> <ALPHANUM>
> >        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> >        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> >        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM>
<P> <HAS_DIGIT>)+
> >        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT>
<P> <ALPHANUM>)+
> >         )
> >   >
> > | <#P: ("_"|"-"|"/"|"."|",") >
> > | <#HAS_DIGIT:                      // at least one digit
> >     (<LETTER>|<DIGIT>)*
> >     <DIGIT>
> >     (<LETTER>|<DIGIT>)*
> >   >
> >
> > | < #ALPHA: (<LETTER>)+>
> > | < #LETTER:                      // unicode letters
> >       [
> >        "\u0041"-"\u005a",
> >        "\u0061"-"\u007a",
> >        "\u00c0"-"\u00d6",
> >        "\u00d8"-"\u00f6",
> >        "\u00f8"-"\u00ff",
> >        "\u0100"-"\u1fff",
> >        "-", "_"
> >       ]
> >   >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
> Sent from the Lucene - Java Users forum at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message