Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com
 designates 209.85.128.189 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references;
        b=o/L/wduIaS1rxBnkkdHwq2MCxQsWMc71/YN1zop0YlFNbrfKKAiEDO7QC8RXoE0VcbSO40WtRFSN5XKyDfHnc87oTdIqppar5bYgrZyb6CB7PYSUq1AmWl3nq8yyaCMvcpQ13ZcUlKE8OgqYADLm0gamNlAcddoMrpcYU8HqlOE=
Message-ID: <359a92830802051219n121f843dt7d8bb9b3aa6a1051@mail.gmail.com>
Date: Tue, 5 Feb 2008 15:19:45 -0500
From: "Erick Erickson" <erickerickson@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Extracting terms from a query splitting a phrase.
In-Reply-To: <2d72c9c50802051203v72b84bd6ic42a4b968947c6d3@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_9796_4369939.1202242785698"
References: <2d72c9c50802051203v72b84bd6ic42a4b968947c6d3@mail.gmail.com>

------=_Part_9796_4369939.1202242785698
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

I don't think WhitespaceAnalyzer is doing what you think it is. From
the Javadoc...

public class *WhitespaceTokenizer*extends
CharTokenizer<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/analysis/CharTokenizer.html>

A WhitespaceTokenizer is a tokenizer that divides text at
whitespace. Adjacent sequences of non-Whitespace characters form tokens.

 ------------------------------

 CharacterTokenizer
An abstract base class for simple, character-oriented tokenizers.

So I'm pretty sure that CharacterTokenizer is throwing out all the
non-character data (i.e. your double quotes), then WhitespaceTokenizer
is breaking on the space.

What is it that you want to have happen? If you're searching for
"General" right next to "Act", you can use a SpanNearQuery with
two SpanTermQuerys and a slop of 0.

The other thing to be aware of with WhitespaceAnalyzer is that
it doesn't lower case anything, so whether you'll get any hits
in your index depends upon the analyzers you used to index with
and whether case matches exactly.

Best
Erick

On Feb 5, 2008 3:03 PM, Spencer Tickner <spencertickner@gmail.com> wrote:

> Hi List,
>
> Thanks in advance for the help. I'm trying to extract terms from a
> query. From the reading I've done a phrase such as "General Act" is
> considered a term.
> http://lucene.apache.org/java/docs/queryparsersyntax.html#Terms .
> However when I'm doing testing to get the extractTerms of my query it
> splits this into General and Act. I'm wondering if I'm missing or not
> understanding something.
>
> My test Java code is:
>
>        private String FIELD_NAME = "rr_root";
>        private Query query;
>        private Hits hits = null;
>
>        public void testSearch() throws Exception
>        {
>                doSearching("\"General Act\"");
>                HashSet terms = new HashSet();
>                query.extractTerms(terms);
>                int i = 0;
>                for (Iterator iter = terms.iterator(); iter.hasNext();)
>                {
>                        i++;
>                        Term term = (Term)iter.next();
>                        System.out.println(i + " " + "term-" + term.text()
> + " field-" +
> term.field());
>                }
>         }
>
>        public void doSearching(String queryString) throws Exception
>        {
>                QueryParser parser=new QueryParser(FIELD_NAME, new
> WhitespaceAnalyzer());
>                query = parser.parse(queryString);
>                doSearching(query);
>        }
>        public void doSearching(Query unReWrittenQuery) throws Exception
>        {
>                searcher = aspect.getSearcher(); // searcher comming from a
> cahed class
>                query=unReWrittenQuery.rewrite(aspect.getReader()); //
> reader
> comming from a cached class
>                System.out.println("Searching for: " + query.toString
> (FIELD_NAME));
>                hits = searcher.search(query);
>        }
>
> The current output is:
>
> Searching for: "General Act"
> 1 term-General field-rr_root
> 2 term-Act field-rr_root
>
> The output I expect is:
>
> Searching for: "General Act"
> 1 term-General Act field-rr_root
>
> Thanks for any help.
>
> Spencer
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

------=_Part_9796_4369939.1202242785698--