Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of patrek@gmail.com designates
 66.249.92.175 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=r2lip5moNXtALen4Ffx2iPs5PXVotTnADjCQ6e5GSoQbiXLGVozmWZGNuonn5YZk+ijHqmhGh5P6MduZR30uAQW5HzfRhymWe+tlOnS+vRc258Od7xxY3cgzgRVyiprXu7sADAE9Ns9fQqVEzKFc4Ze9LbyjfM2Hz2eBhgp3IiE=
Message-ID: <48b038c60710010703u1370e6cu170d15fe1480f607@mail.gmail.com>
Date: Mon, 1 Oct 2007 10:03:15 -0400
From: "Patrick Turcotte" <patrek@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Indexing puncuation and symbols
In-Reply-To: <4700FC1C.6010707@propylon.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
References: <4700F72B.1010609@propylon.com>
	 <0C09F3B2-8C01-4C61-970E-B4673971B526@gmail.com>
	 <4700FC1C.6010707@propylon.com>

Hi,

Don't know the size of your dataset. But, couldn't you index in 2
fields, with PerFieldAnalyzer, tokenizing with Standard for 1 field,
and WhiteSpace for the other.

Then use multiple field query (there is a query parser for that, just
don't remember the name right now).

Patrick

On 10/1/07, John Byrne <john.byrne@propylon.com> wrote:
> Whitespace analyzer does preserve those symbols, but not as tokens. It
> simply leaves them attached to the original term.
>
> As an example of what I'm talking about, consider a document that
> contains (without the quotes) "foo, ".
>
> Now, using WhitespaceAnalyzer, I could only get that document by
> searching for "foo,". Using StandardAnalyzer or any analyzer that
> removes punctuation, I could only find it by searching for "foo".
>
> I want an analyzer that will allow me to find it if I build a phrase
> query with the term "foo" followed immediately by ",". After all, the
> comma may be relevant to the search, but is definitely not part of the
> word.
>
> Extending StandardAnalyer is what I had in mind, but I don't know where
> to start. I also wonder why no-one seems to have done it before- it
> makes me suspect that there's some reason I haven't seen yet that makes
> it impossible ot impractical.
>
>
>
> Karl Wettin wrote:
> >
> > 1 okt 2007 kl. 15.33 skrev John Byrne:
> >
> >> Has anyone written an analyzer that preserves puncuation and
> >> synmbols ("=A3", "$", "%" etc.) as tokens?
> >
> > WhitespaceAnalyzer?
> >
> > You could also extend the lexical rules of StandardAnalyzer.
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org