Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 56127 invoked from network); 1 Sep 2006 22:02:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 1 Sep 2006 22:02:46 -0000 Received: (qmail 13725 invoked by uid 500); 1 Sep 2006 22:02:41 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 13689 invoked by uid 500); 1 Sep 2006 22:02:41 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 13676 invoked by uid 99); 1 Sep 2006 22:02:41 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Sep 2006 15:02:41 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of lists@nabble.com designates 72.21.53.35 as permitted sender) Received: from [72.21.53.35] (HELO talk.nabble.com) (72.21.53.35) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Sep 2006 15:02:39 -0700 Received: from [72.21.53.38] (helo=jubjub.nabble.com) by talk.nabble.com with esmtp (Exim 4.50) id 1GJH5e-0004LM-TS for java-user@lucene.apache.org; Fri, 01 Sep 2006 15:02:18 -0700 Message-ID: <6106920.post@talk.nabble.com> Date: Fri, 1 Sep 2006 15:02:18 -0700 (PDT) From: Philip Brown To: java-user@lucene.apache.org Subject: Re: Phrase search using quotes -- special Tokenizer In-Reply-To: <44F83F0A.7080005@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: pmb@us.ibm.com References: <6093138.post@talk.nabble.com> <44F81F0D.7080605@gmail.com> <6098930.post@talk.nabble.com> <44F83F0A.7080005@gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Well, I tried that, and it doesn't seem to work still. I would be happy to zip up the new files, so you can see what I'm using -- maybe you can get it to work. The first time, I tried building the documents without quotes surrounding each phrase. Then, I retried by enclosing every phrase within double quotes. Neither seemed to work. When constructing the query string for the search, I always added the double quotes (otherwise, it'd think it was multiple terms). (I didn't even test the underscore and hyphenated terms.) I thought Lucene was (sort of by default) set up to search quoted phrases. From http://lucene.apache.org/java/docs/api/index.html --> A Phrase is a group of words surrounded by double quotes such as "hello dolly". So, this should be easy, right? I must be missing something stupid. Thanks, Philip Mark Miller-5 wrote: > > So this will recognize anything in quotes as a single token and '_' and > '-' will not break up words. There may be some repercussions for the NUM > token but nothing I'd worry about. maybe you want to use Unicode for '-' > and '_' as well...I wouldn't worry about it myself. > > - Mark > > > TOKEN : { // token patterns > > // basic word: a sequence of digits & letters > ||)+ > > > | > > // internal apostrophes: O'Reilly, you're, O'Reilly's > // use a post-filter to remove possesives > | ("'" )+ > > > // acronyms: U.S.A., I.B.M., etc. > // use a post-filter to remove dots > | "." ( ".")+ > > > // company names like AT&T and Excite@Home. > | ("&"|"@") > > > // email addresses > | (("."|"-"|"_") )* "@" > (("."|"-") )+ > > > // hostname > | ("." )+ > > > // floating point, serial, model numbers, ip addresses, etc. > // every other segment must have at least one digit > |

> |

> | (

)+ > | (

)+ > |

(

)+ > |

(

)+ > ) > > > | <#P: ("_"|"-"|"/"|"."|",") > > | <#HAS_DIGIT: // at least one digit > (|)* > > (|)* > > > > | < #ALPHA: ()+> > | < #LETTER: // unicode letters > [ > "\u0041"-"\u005a", > "\u0061"-"\u007a", > "\u00c0"-"\u00d6", > "\u00d8"-"\u00f6", > "\u00f8"-"\u00ff", > "\u0100"-"\u1fff", > "-", "_" > ] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > -- View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920 Sent from the Lucene - Java Users forum at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org