lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Phrase search using quotes -- special Tokenizer
Date Fri, 01 Sep 2006 23:59:42 GMT
OK, I've gotta ask. Have you examined your index with Luke to see if what
you *think* is in the index actually *is*???

Erick

On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>
>
> Interesting...just ran a test where I put double quotes around everything
> (including single keywords) of source text and then ran searches for a
> known
> keyword with and without double quotes -- doesn't find either time.
>
>
> Mark Miller-5 wrote:
> >
> > Sorry to hear you're having trouble. You indeed need the double quotes
> in
> > the source text. You will also need them in the query string. Make sure
> > they
> > are in both places. My machine is hosed right now or I would do it for
> you
> > real quick. My guess is that I forgot to mention...no only do you need
> to
> > add the <QUOTED> definiton to the TOKEN section, but below that you will
> > find the grammer...you need to add <QUOTED> to the grammer. If you look
> > how
> > <NUM> and <APOSTROPHE> are done you will prob see what you should do.
If
> > not, my machine should be back up tomarrow...
> >
> > - Mark
> >
> > On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
> >>
> >>
> >> Well, I tried that, and it doesn't seem to work still.  I would be
> happy
> >> to
> >> zip up the new files, so you can see what I'm using -- maybe you can
> get
> >> it
> >> to work.  The first time, I tried building the documents without quotes
> >> surrounding each phrase.  Then, I retried by enclosing every phrase
> >> within
> >> double quotes.  Neither seemed to work.  When constructing the query
> >> string
> >> for the search, I always added the double quotes (otherwise, it'd think
> >> it
> >> was multiple terms).  (I didn't even test the underscore and hyphenated
> >> terms.)  I thought Lucene was (sort of by default) set up to search
> >> quoted
> >> phrases.  From http://lucene.apache.org/java/docs/api/index.html --> A
> >> Phrase is a group of words surrounded by double quotes such as "hello
> >> dolly".  So, this should be easy, right?  I must be missing something
> >> stupid.
> >>
> >> Thanks,
> >>
> >> Philip
> >>
> >>
> >> Mark Miller-5 wrote:
> >> >
> >> > So this will recognize anything in quotes as a single token and '_'
> and
> >> > '-' will not break up words. There may be some repercussions for the
> >> NUM
> >> > token but nothing I'd worry about. maybe you want to use Unicode for
> >> '-'
> >> > and '_' as well...I wouldn't worry about it myself.
> >> >
> >> > - Mark
> >> >
> >> >
> >> > TOKEN : {                      // token patterns
> >> >
> >> >   // basic word: a sequence of digits & letters
> >> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
> >> >
> >> > | <QUOTED:     "\"" (~["\""])+ "\"">
> >> >
> >> >   // internal apostrophes: O'Reilly, you're, O'Reilly's
> >> >   // use a post-filter to remove possesives
> >> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
> >> >
> >> >   // acronyms: U.S.A., I.B.M., etc.
> >> >   // use a post-filter to remove dots
> >> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
> >> >
> >> >   // company names like AT&T and Excite@Home.
> >> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
> >> >
> >> >   // email addresses
> >> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
> >> > (("."|"-") <ALPHANUM>)+ >
> >> >
> >> >   // hostname
> >> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
> >> >
> >> >   // floating point, serial, model numbers, ip addresses, etc.
> >> >   // every other segment must have at least one digit
> >> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
> >> >        | <HAS_DIGIT> <P> <ALPHANUM>
> >> >        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> >> >        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> >> >        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM>
<P> <HAS_DIGIT>)+
> >> >        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT>
<P> <ALPHANUM>)+
> >> >         )
> >> >   >
> >> > | <#P: ("_"|"-"|"/"|"."|",") >
> >> > | <#HAS_DIGIT:                      // at least one digit
> >> >     (<LETTER>|<DIGIT>)*
> >> >     <DIGIT>
> >> >     (<LETTER>|<DIGIT>)*
> >> >   >
> >> >
> >> > | < #ALPHA: (<LETTER>)+>
> >> > | < #LETTER:                      // unicode letters
> >> >       [
> >> >        "\u0041"-"\u005a",
> >> >        "\u0061"-"\u007a",
> >> >        "\u00c0"-"\u00d6",
> >> >        "\u00d8"-"\u00f6",
> >> >        "\u00f8"-"\u00ff",
> >> >        "\u0100"-"\u1fff",
> >> >        "-", "_"
> >> >       ]
> >> >   >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
> >> Sent from the Lucene - Java Users forum at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649
> Sent from the Lucene - Java Users forum at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message