Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 20558 invoked from network); 5 Feb 2008 20:20:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Feb 2008 20:20:24 -0000 Received: (qmail 54229 invoked by uid 500); 5 Feb 2008 20:20:10 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 54199 invoked by uid 500); 5 Feb 2008 20:20:10 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 54188 invoked by uid 99); 5 Feb 2008 20:20:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Feb 2008 12:20:10 -0800 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 209.85.128.189 as permitted sender) Received: from [209.85.128.189] (HELO fk-out-0910.google.com) (209.85.128.189) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Feb 2008 20:19:41 +0000 Received: by fk-out-0910.google.com with SMTP id z23so2422767fkz.5 for ; Tue, 05 Feb 2008 12:19:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; bh=/8mLaqD44CAwjzx8u1bUiiOlcbE++sXmPLsAs6wbu4A=; b=b/Ym4Et/FE5qjliivpHm03Wnr1WnZ0PWIuFy9KYG3M++AJT2/ZwpmNXZd0LI9gvxG82NjXtZzewtKO707983ZtVMptk0V/oazfWg4ehQF1CGuHPZvJrLaWt/2uI6oWCMlJE2+fop00U5qLqBcQ8N7cfPAOOL/RqXKI0mplpdRtc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=o/L/wduIaS1rxBnkkdHwq2MCxQsWMc71/YN1zop0YlFNbrfKKAiEDO7QC8RXoE0VcbSO40WtRFSN5XKyDfHnc87oTdIqppar5bYgrZyb6CB7PYSUq1AmWl3nq8yyaCMvcpQ13ZcUlKE8OgqYADLm0gamNlAcddoMrpcYU8HqlOE= Received: by 10.82.138.6 with SMTP id l6mr16246891bud.13.1202242785704; Tue, 05 Feb 2008 12:19:45 -0800 (PST) Received: by 10.82.151.7 with HTTP; Tue, 5 Feb 2008 12:19:45 -0800 (PST) Message-ID: <359a92830802051219n121f843dt7d8bb9b3aa6a1051@mail.gmail.com> Date: Tue, 5 Feb 2008 15:19:45 -0500 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: Extracting terms from a query splitting a phrase. In-Reply-To: <2d72c9c50802051203v72b84bd6ic42a4b968947c6d3@mail.gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_9796_4369939.1202242785698" References: <2d72c9c50802051203v72b84bd6ic42a4b968947c6d3@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_9796_4369939.1202242785698 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline I don't think WhitespaceAnalyzer is doing what you think it is. From the Javadoc... public class *WhitespaceTokenizer*extends CharTokenizer A WhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens. ------------------------------ CharacterTokenizer An abstract base class for simple, character-oriented tokenizers. So I'm pretty sure that CharacterTokenizer is throwing out all the non-character data (i.e. your double quotes), then WhitespaceTokenizer is breaking on the space. What is it that you want to have happen? If you're searching for "General" right next to "Act", you can use a SpanNearQuery with two SpanTermQuerys and a slop of 0. The other thing to be aware of with WhitespaceAnalyzer is that it doesn't lower case anything, so whether you'll get any hits in your index depends upon the analyzers you used to index with and whether case matches exactly. Best Erick On Feb 5, 2008 3:03 PM, Spencer Tickner wrote: > Hi List, > > Thanks in advance for the help. I'm trying to extract terms from a > query. From the reading I've done a phrase such as "General Act" is > considered a term. > http://lucene.apache.org/java/docs/queryparsersyntax.html#Terms . > However when I'm doing testing to get the extractTerms of my query it > splits this into General and Act. I'm wondering if I'm missing or not > understanding something. > > My test Java code is: > > private String FIELD_NAME = "rr_root"; > private Query query; > private Hits hits = null; > > public void testSearch() throws Exception > { > doSearching("\"General Act\""); > HashSet terms = new HashSet(); > query.extractTerms(terms); > int i = 0; > for (Iterator iter = terms.iterator(); iter.hasNext();) > { > i++; > Term term = (Term)iter.next(); > System.out.println(i + " " + "term-" + term.text() > + " field-" + > term.field()); > } > } > > public void doSearching(String queryString) throws Exception > { > QueryParser parser=new QueryParser(FIELD_NAME, new > WhitespaceAnalyzer()); > query = parser.parse(queryString); > doSearching(query); > } > public void doSearching(Query unReWrittenQuery) throws Exception > { > searcher = aspect.getSearcher(); // searcher comming from a > cahed class > query=unReWrittenQuery.rewrite(aspect.getReader()); // > reader > comming from a cached class > System.out.println("Searching for: " + query.toString > (FIELD_NAME)); > hits = searcher.search(query); > } > > The current output is: > > Searching for: "General Act" > 1 term-General field-rr_root > 2 term-Act field-rr_root > > The output I expect is: > > Searching for: "General Act" > 1 term-General Act field-rr_root > > Thanks for any help. > > Spencer > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_9796_4369939.1202242785698--