Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 68596 invoked from network); 2 Sep 2006 18:31:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 2 Sep 2006 18:31:47 -0000 Received: (qmail 2569 invoked by uid 500); 2 Sep 2006 18:31:42 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 2046 invoked by uid 500); 2 Sep 2006 18:31:40 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 2035 invoked by uid 99); 2 Sep 2006 18:31:40 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Sep 2006 11:31:40 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of lists@nabble.com designates 72.21.53.35 as permitted sender) Received: from [72.21.53.35] (HELO talk.nabble.com) (72.21.53.35) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Sep 2006 11:31:39 -0700 Received: from [72.21.53.38] (helo=jubjub.nabble.com) by talk.nabble.com with esmtp (Exim 4.50) id 1GJaH0-0006jI-Sx for java-user@lucene.apache.org; Sat, 02 Sep 2006 11:31:18 -0700 Message-ID: <6115360.post@talk.nabble.com> Date: Sat, 2 Sep 2006 11:31:18 -0700 (PDT) From: Philip Brown To: java-user@lucene.apache.org Subject: Re: Phrase search using quotes -- special Tokenizer In-Reply-To: <44F98D2C.7030007@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: pmb@us.ibm.com References: <6093138.post@talk.nabble.com> <44F81F0D.7080605@gmail.com> <6098930.post@talk.nabble.com> <44F83F0A.7080005@gmail.com> <6106920.post@talk.nabble.com> <6107649.post@talk.nabble.com> <359a92830609011659o51839642g31502fef0fc86b28@mail.gmail.com> <6109067.post@talk.nabble.com> <359a92830609020643le432a02qeb19b6ec906e915f@mail.gmail.com> <44F98D2C.7030007@gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I tend to agree with Mark. I tried a query as so... TermQuery query = new TermQuery(new Term("keywordField", "phrase test")); IndexSearcher searcher= new IndexSearcher(activeIdx); Hits hits = searcher.search(query); And this produced the expected results. When building the index, I did NOT enclose the keywords in quotes -- just added as UN_TOKENIZED. Philip Mark Miller-5 wrote: > > I think if he wants to use the queryparser to parse his search strings > that he has no choice but to modify it. It will eat any pair of quotes > going through it no matter what analyzer is used. > > - Mark >> Well, you're flying blind. Is the behavior rooted in the indexing or >> querying? Since you can't answer that, you're reduced to trying random >> things hoping that one of them works. A little like voodoo. I've wasted >> faaaaarrrrrr too much time trying to solve what I was *sure* was the >> problem >> only to find it was somewhere else (the last place I look, of course) >> ... >> >> Using Luke on a RAMDir. No, I don't know how to, but it should be a >> simple >> thing to write the index to an FSDir at the same time you create your >> RAMDir >> and use Luke then. This is debugging, after all. >> >> I'd be really, really, really reluctant to modify the query parser and/or >> the tokenizer, since whenever I've been tempted it's usually because I >> don't >> understand the tools already provided. Then I have to maintain my custom >> code. Which sucks. Although it sure feels more productive to hack a >> bunch of >> code and get something that works 90% of the time, then spend weeks >> making >> the other 10% work than taking two days to find the 3 lines you *really* >> need . >> >> Have you thought of a PatternAnalyzer? It takes a regular expression >> as the >> tokenizer and (from the Javadoc) >> <<< Efficient Lucene analyzer/tokenizer that preferably operates on a >> String >> rather than a >> Reader, >> that can flexibly separate text into terms via a regular expression >> Pattern(with >> >> behaviour identical to >> String.split(String)), >> >> and that combines the functionality of >> LetterTokenizer, >> >> LowerCaseTokenizer, >> >> WhitespaceTokenizer, >> >> StopFilterinto >> >> a single efficient multi-purpose class.>>> >> >> One word of caution, the regular expression consists of expressions that >> *break* tokens, not expressions that *form* words, which threw me at >> first. >> Just like the doc says, like splitstring .... This is in 2.0, >> although I >> *believe* it's also in the contrib section of 1.9 (or is in the >> regular API, >> I forget). >> >> Best >> Erick >> >> On 9/1/06, Philip Brown wrote: >>> >>> >>> No, I've never used Luke. Is there an easy way to examine my >>> RAMDirectory >>> index? I can create the index with no quoted keywords, and when I >>> search >>> for a keyword, I get back the expected results (just can't search for a >>> phrase that has whitespace in it). If I create the index with >>> phrases in >>> quotes, then when I search for anything in double quotes, I get back >>> nothing. If I create the index with everything in quotes, then when I >>> search for anything by the keyword field, I get nothing, regardless of >>> whether I use quotes in the query string or not. (I can get results >>> back >>> by >>> searching on other fields.) What do you think? >>> >>> Philip >>> >>> >>> Erick Erickson wrote: >>> > >>> > OK, I've gotta ask. Have you examined your index with Luke to see if >>> what >>> > you *think* is in the index actually *is*??? >>> > >>> > Erick >>> > >>> > On 9/1/06, Philip Brown wrote: >>> >> >>> >> >>> >> Interesting...just ran a test where I put double quotes around >>> everything >>> >> (including single keywords) of source text and then ran searches >>> for a >>> >> known >>> >> keyword with and without double quotes -- doesn't find either time. >>> >> >>> >> >>> >> Mark Miller-5 wrote: >>> >> > >>> >> > Sorry to hear you're having trouble. You indeed need the double >>> quotes >>> >> in >>> >> > the source text. You will also need them in the query string. Make >>> sure >>> >> > they >>> >> > are in both places. My machine is hosed right now or I would do it >>> for >>> >> you >>> >> > real quick. My guess is that I forgot to mention...no only do you >>> need >>> >> to >>> >> > add the definiton to the TOKEN section, but below that you >>> >> will >>> >> > find the grammer...you need to add to the grammer. If you >>> look >>> >> > how >>> >> > and are done you will prob see what you >>> should do. >>> >> If >>> >> > not, my machine should be back up tomarrow... >>> >> > >>> >> > - Mark >>> >> > >>> >> > On 9/1/06, Philip Brown wrote: >>> >> >> >>> >> >> >>> >> >> Well, I tried that, and it doesn't seem to work still. I would be >>> >> happy >>> >> >> to >>> >> >> zip up the new files, so you can see what I'm using -- maybe >>> you can >>> >> get >>> >> >> it >>> >> >> to work. The first time, I tried building the documents without >>> >> quotes >>> >> >> surrounding each phrase. Then, I retried by enclosing every >>> phrase >>> >> >> within >>> >> >> double quotes. Neither seemed to work. When constructing the >>> query >>> >> >> string >>> >> >> for the search, I always added the double quotes (otherwise, it'd >>> >> think >>> >> >> it >>> >> >> was multiple terms). (I didn't even test the underscore and >>> >> hyphenated >>> >> >> terms.) I thought Lucene was (sort of by default) set up to >>> search >>> >> >> quoted >>> >> >> phrases. From >>> http://lucene.apache.org/java/docs/api/index.html --> >>> A >>> >> >> Phrase is a group of words surrounded by double quotes such as >>> "hello >>> >> >> dolly". So, this should be easy, right? I must be missing >>> something >>> >> >> stupid. >>> >> >> >>> >> >> Thanks, >>> >> >> >>> >> >> Philip >>> >> >> >>> >> >> >>> >> >> Mark Miller-5 wrote: >>> >> >> > >>> >> >> > So this will recognize anything in quotes as a single token and >>> '_' >>> >> and >>> >> >> > '-' will not break up words. There may be some repercussions for >>> the >>> >> >> NUM >>> >> >> > token but nothing I'd worry about. maybe you want to use Unicode >>> for >>> >> >> '-' >>> >> >> > and '_' as well...I wouldn't worry about it myself. >>> >> >> > >>> >> >> > - Mark >>> >> >> > >>> >> >> > >>> >> >> > TOKEN : { // token patterns >>> >> >> > >>> >> >> > // basic word: a sequence of digits & letters >>> >> >> > ||)+ > >>> >> >> > >>> >> >> > | >>> >> >> > >>> >> >> > // internal apostrophes: O'Reilly, you're, O'Reilly's >>> >> >> > // use a post-filter to remove possesives >>> >> >> > | ("'" )+ > >>> >> >> > >>> >> >> > // acronyms: U.S.A., I.B.M., etc. >>> >> >> > // use a post-filter to remove dots >>> >> >> > | "." ( ".")+ > >>> >> >> > >>> >> >> > // company names like AT&T and Excite@Home. >>> >> >> > | ("&"|"@") > >>> >> >> > >>> >> >> > // email addresses >>> >> >> > | (("."|"-"|"_") )* "@" >>> >> >> > (("."|"-") )+ > >>> >> >> > >>> >> >> > // hostname >>> >> >> > | ("." )+ > >>> >> >> > >>> >> >> > // floating point, serial, model numbers, ip addresses, etc. >>> >> >> > // every other segment must have at least one digit >>> >> >> > |

>>> >> >> > |

>>> >> >> > | (

)+ >>> >> >> > | (

)+ >>> >> >> > |

(

>>> >> )+ >>> >> >> > |

(

>>> >> )+ >>> >> >> > ) >>> >> >> > > >>> >> >> > | <#P: ("_"|"-"|"/"|"."|",") > >>> >> >> > | <#HAS_DIGIT: // at least one digit >>> >> >> > (|)* >>> >> >> > >>> >> >> > (|)* >>> >> >> > > >>> >> >> > >>> >> >> > | < #ALPHA: ()+> >>> >> >> > | < #LETTER: // unicode letters >>> >> >> > [ >>> >> >> > "\u0041"-"\u005a", >>> >> >> > "\u0061"-"\u007a", >>> >> >> > "\u00c0"-"\u00d6", >>> >> >> > "\u00d8"-"\u00f6", >>> >> >> > "\u00f8"-"\u00ff", >>> >> >> > "\u0100"-"\u1fff", >>> >> >> > "-", "_" >>> >> >> > ] >>> >> >> > > >>> >> >> > >>> >> >> > >>> >> --------------------------------------------------------------------- >>> >> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> >> >> > For additional commands, e-mail: >>> java-user-help@lucene.apache.org >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> >>> >> >> -- >>> >> >> View this message in context: >>> >> >> >>> >> >>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920 >>> >>> >> >> Sent from the Lucene - Java Users forum at Nabble.com. >>> >> >> >>> >> >> >>> >> >> >>> --------------------------------------------------------------------- >>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >> >> >>> >> >> >>> >> > >>> >> > >>> >> >>> >> -- >>> >> View this message in context: >>> >> >>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649 >>> >>> >> Sent from the Lucene - Java Users forum at Nabble.com. >>> >> >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> >> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >> >>> >> >>> > >>> > >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6109067 >>> >>> Sent from the Lucene - Java Users forum at Nabble.com. >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > -- View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6115360 Sent from the Lucene - Java Users forum at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org