lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Brown <...@us.ibm.com>
Subject Re: Phrase search using quotes -- special Tokenizer
Date Fri, 01 Sep 2006 22:02:18 GMT

Well, I tried that, and it doesn't seem to work still.  I would be happy to
zip up the new files, so you can see what I'm using -- maybe you can get it
to work.  The first time, I tried building the documents without quotes
surrounding each phrase.  Then, I retried by enclosing every phrase within
double quotes.  Neither seemed to work.  When constructing the query string
for the search, I always added the double quotes (otherwise, it'd think it
was multiple terms).  (I didn't even test the underscore and hyphenated
terms.)  I thought Lucene was (sort of by default) set up to search quoted
phrases.  From http://lucene.apache.org/java/docs/api/index.html --> A
Phrase is a group of words surrounded by double quotes such as "hello
dolly".  So, this should be easy, right?  I must be missing something
stupid.

Thanks,

Philip


Mark Miller-5 wrote:
> 
> So this will recognize anything in quotes as a single token and '_' and 
> '-' will not break up words. There may be some repercussions for the NUM 
> token but nothing I'd worry about. maybe you want to use Unicode for '-' 
> and '_' as well...I wouldn't worry about it myself.
> 
> - Mark
> 
> 
> TOKEN : {                      // token patterns
> 
>   // basic word: a sequence of digits & letters
>   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
> 
> | <QUOTED:     "\"" (~["\""])+ "\"">
> 
>   // internal apostrophes: O'Reilly, you're, O'Reilly's
>   // use a post-filter to remove possesives
> | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
> 
>   // acronyms: U.S.A., I.B.M., etc.
>   // use a post-filter to remove dots
> | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
> 
>   // company names like AT&T and Excite@Home.
> | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
> 
>   // email addresses
> | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>

> (("."|"-") <ALPHANUM>)+ >
> 
>   // hostname
> | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
> 
>   // floating point, serial, model numbers, ip addresses, etc.
>   // every other segment must have at least one digit
> | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>        | <HAS_DIGIT> <P> <ALPHANUM>
>        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
<HAS_DIGIT>)+
>        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
<ALPHANUM>)+
>         )
>   >
> | <#P: ("_"|"-"|"/"|"."|",") >
> | <#HAS_DIGIT:                      // at least one digit
>     (<LETTER>|<DIGIT>)*
>     <DIGIT>
>     (<LETTER>|<DIGIT>)*
>   >
> 
> | < #ALPHA: (<LETTER>)+>
> | < #LETTER:                      // unicode letters
>       [
>        "\u0041"-"\u005a",
>        "\u0061"-"\u007a",
>        "\u00c0"-"\u00d6",
>        "\u00d8"-"\u00f6",
>        "\u00f8"-"\u00ff",
>        "\u0100"-"\u1fff",
>        "-", "_"
>       ]
>   >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message