Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 19137 invoked from network); 14 Jun 2007 19:07:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 14 Jun 2007 19:07:00 -0000 Received: (qmail 45691 invoked by uid 500); 14 Jun 2007 19:06:55 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 45665 invoked by uid 500); 14 Jun 2007 19:06:55 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 45654 invoked by uid 99); 14 Jun 2007 19:06:55 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jun 2007 12:06:55 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of markrmiller@gmail.com designates 209.85.132.245 as permitted sender) Received: from [209.85.132.245] (HELO an-out-0708.google.com) (209.85.132.245) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jun 2007 12:06:51 -0700 Received: by an-out-0708.google.com with SMTP id b20so150014ana for ; Thu, 14 Jun 2007 12:06:30 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=Q3yuBL2O1G4N/Nx0uxd69QctNdUfrEAHNaHam7btzyMAn5JYhOKuUUjiFbmVLJlqauPbvUO7/Bgi1Jp3yzgAQScXabos4NH9YUY51RCPqy4F+yLF/wYLZ0cgDgWkJimmS4zmUkkgxeYVlC5tLcISh3M+I2ESdzEyOwKSXEV3Gho= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=F9846ljqQkJHFOruJMklZy292qDe62im17XN8Ov1ae4IwWEhbGEzwzaIN8TiBmvKNbfsus2s0+hzCQ5rgEKlgqS++4NZ8bxGlL6rqLjHvZuCQyhCFAH8Q2j+UF9ah3cC8gKO/sattfNt4DEm68VMB3kOqZPXJyhhD7UP4m9fyJs= Received: by 10.100.41.8 with SMTP id o8mr482236ano.1181847990232; Thu, 14 Jun 2007 12:06:30 -0700 (PDT) Received: by 10.100.9.12 with HTTP; Thu, 14 Jun 2007 12:06:30 -0700 (PDT) Message-ID: Date: Thu, 14 Jun 2007 15:06:30 -0400 From: "Mark Miller" To: java-user@lucene.apache.org Subject: Re: Wildcard query with untokenized punctuation (again) In-Reply-To: <000901c7aea9$9011f330$970fda80@libraria.ucsf.edu> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_46323_14500185.1181847990198" References: <467145FE.5070602@gmail.com> <000901c7aea9$9011f330$970fda80@libraria.ucsf.edu> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_46323_14500185.1181847990198 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline All depends on what you are looking for. Ill try and give a hint as to what is going on now: When the QueryParser parsers <> it will shove that whole piece to the analyzer. Your analyzer returns two tokens: smith and ann. When the QueryParser sees that more than one token is returned from a piece that was fed to the analyzer, it makes a PhraseQuery with the each of the returned tokens. Remember that the QueryParser feeds the analyzer in pieces, and then creates queries based on the number of token produced from the piece (if the piece even goes to the analyzer). Since you will be preprocessing the query, the query parser is going to be parsing <> which causes it to feed the analyzer smith and then ann*...neither of these pieces produce more than one token (ann* doesnt even go to the analyzer), so no PhraseQuery is produced. Instead you will produce a BooleanQuery with the term smith and the wildcard query ann*, both with an occur of whatever your default operator is. One thing I am wondering is if you even really want the query to be a PhraseQuery or if your just accepting the behavior you getting from the QueryParser. Right now, PhraseQuery's do not support wildcards (nor do MultiPhraseQuery's). I don't think the support would be that difficult (use a wildcard term enumerator to correctly fill out a MultiPhraseQuery), but it might take some thought to get the QueryParser to act as you want (generate a PhraseQuery or MultiPhraseQuery when it sees <>). Are you sure you need a PhraseQuery and not a Boolean query of Should clauses? - Mark On 6/14/07, Renaud Waldura wrote: > > Thanks guys, I like it! I'm already applying some regexps before query > parsing anyway, so it's just another pass. > > Now, I'm not sure how to do that without breaking another QP feature that > I > kind of like: the query <> is parsed to PhraseQuery("smith > ann"). > And that seems right, from a user standpoint. > > In fact, considering this, I realize <> should be parsed to > MultiPhraseQuery("smith", "ann*"), not <<+smith +ann*>> as I said earlier. > > Brrrr. Getting hairy. Any hope? > > --Renaud > > > > -----Original Message----- > From: Mark Miller [mailto:markrmiller@gmail.com] > Sent: Thursday, June 14, 2007 6:43 AM > To: java-user@lucene.apache.org > Subject: Re: Wildcard query with untokenized punctuation (again) > > Gotto agree with Erick here...best idea is just to preprocess the query > before sending it to the QueryParser. > > My first thought is always to get out the sledgehammer... > > - Mark > > Erick Erickson wrote: > > Well, perhaps the simplest thing would be to pre-process the query and > > make the comma into a whitespace before sending anything to the query > > parser. I don't know how generalizable that sort of solution is in > > your problem space though.... > > > > Best > > Erick > > > > On 6/13/07, Renaud Waldura wrote: > >> > >> My very simple analyzer produces tokens made of digits and/or letters > >> only. > >> Anything else is discarded. E.g. the input "smith,anna" gets > >> tokenized as > >> 2 > >> tokens, first "smith" then "anna". > >> > >> Say I have indexed documents that contained both "smith,anna" and > >> "smith,annanicole". To find them, I enter the query <>. > >> The stock Lucene 2.0 query parser produces a PrefixQuery for the > >> single token "smith,ann". This token doesn't exist in my index, and I > >> don't get a match. > >> > >> I have found some references to this: > >> > >> http://www.nabble.com/Wildcard-query-with-untokenized-punctuation-tf3 > >> 378386 > >> > >> . > >> html > >> but I don't understand how I can fix it. Comma-separated terms like > >> this can appear in any field; I don't think I can create an > >> untokenized field. > >> > >> Really what I would like in this case is for the comma to be > >> considered whitespace, and the query to be parsed to <<+smith > >> +ann*>>. Any way I can do that? > >> > >> --Renaud > >> > >> > >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_46323_14500185.1181847990198--