lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Brown <...@us.ibm.com>
Subject Re: Phrase search using quotes -- special Tokenizer
Date Fri, 01 Sep 2006 22:53:19 GMT

Added the <QUOTED> to the other section and reran the javacc and imported the
new files...but, I still get the same result -- no results.  (Quotes are in
the source text and query string.)  Anything else I might be missing?

Philip


Mark Miller-5 wrote:
> 
> Sorry to hear you're having trouble. You indeed need the double quotes in
> the source text. You will also need them in the query string. Make sure
> they
> are in both places. My machine is hosed right now or I would do it for you
> real quick. My guess is that I forgot to mention...no only do you need to
> add the <QUOTED> definiton to the TOKEN section, but below that you will
> find the grammer...you need to add <QUOTED> to the grammer. If you look
> how
> <NUM> and <APOSTROPHE> are done you will prob see what you should do. If
> not, my machine should be back up tomarrow...
> 
> - Mark
> 
> On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>>
>>
>> Well, I tried that, and it doesn't seem to work still.  I would be happy
>> to
>> zip up the new files, so you can see what I'm using -- maybe you can get
>> it
>> to work.  The first time, I tried building the documents without quotes
>> surrounding each phrase.  Then, I retried by enclosing every phrase
>> within
>> double quotes.  Neither seemed to work.  When constructing the query
>> string
>> for the search, I always added the double quotes (otherwise, it'd think
>> it
>> was multiple terms).  (I didn't even test the underscore and hyphenated
>> terms.)  I thought Lucene was (sort of by default) set up to search
>> quoted
>> phrases.  From http://lucene.apache.org/java/docs/api/index.html --> A
>> Phrase is a group of words surrounded by double quotes such as "hello
>> dolly".  So, this should be easy, right?  I must be missing something
>> stupid.
>>
>> Thanks,
>>
>> Philip
>>
>>
>> Mark Miller-5 wrote:
>> >
>> > So this will recognize anything in quotes as a single token and '_' and
>> > '-' will not break up words. There may be some repercussions for the
>> NUM
>> > token but nothing I'd worry about. maybe you want to use Unicode for
>> '-'
>> > and '_' as well...I wouldn't worry about it myself.
>> >
>> > - Mark
>> >
>> >
>> > TOKEN : {                      // token patterns
>> >
>> >   // basic word: a sequence of digits & letters
>> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
>> >
>> > | <QUOTED:     "\"" (~["\""])+ "\"">
>> >
>> >   // internal apostrophes: O'Reilly, you're, O'Reilly's
>> >   // use a post-filter to remove possesives
>> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
>> >
>> >   // acronyms: U.S.A., I.B.M., etc.
>> >   // use a post-filter to remove dots
>> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
>> >
>> >   // company names like AT&T and Excite@Home.
>> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
>> >
>> >   // email addresses
>> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
>> > (("."|"-") <ALPHANUM>)+ >
>> >
>> >   // hostname
>> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
>> >
>> >   // floating point, serial, model numbers, ip addresses, etc.
>> >   // every other segment must have at least one digit
>> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>> >        | <HAS_DIGIT> <P> <ALPHANUM>
>> >        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>> >        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>> >        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM>
<P> <HAS_DIGIT>)+
>> >        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT>
<P> <ALPHANUM>)+
>> >         )
>> >   >
>> > | <#P: ("_"|"-"|"/"|"."|",") >
>> > | <#HAS_DIGIT:                      // at least one digit
>> >     (<LETTER>|<DIGIT>)*
>> >     <DIGIT>
>> >     (<LETTER>|<DIGIT>)*
>> >   >
>> >
>> > | < #ALPHA: (<LETTER>)+>
>> > | < #LETTER:                      // unicode letters
>> >       [
>> >        "\u0041"-"\u005a",
>> >        "\u0061"-"\u007a",
>> >        "\u00c0"-"\u00d6",
>> >        "\u00d8"-"\u00f6",
>> >        "\u00f8"-"\u00ff",
>> >        "\u0100"-"\u1fff",
>> >        "-", "_"
>> >       ]
>> >   >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
>> Sent from the Lucene - Java Users forum at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107527
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message