Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 49952 invoked from network); 5 Mar 2008 11:42:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Mar 2008 11:42:24 -0000 Received: (qmail 31842 invoked by uid 500); 5 Mar 2008 11:42:06 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 31801 invoked by uid 500); 5 Mar 2008 11:42:06 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 31787 invoked by uid 99); 5 Mar 2008 11:42:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Mar 2008 03:42:06 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [217.41.57.100] (HELO 2ls.com) (217.41.57.100) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Mar 2008 11:41:27 +0000 Subject: RE: C++ as token in StandardAnalyzer? MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Wed, 5 Mar 2008 11:41:34 -0000 Content-class: urn:content-classes:message X-MimeOLE: Produced By Microsoft Exchange V6.5 Message-ID: <562BCF4FF1D14343BC81378BA0050AD90AB53A@SBS2003.2LSHQ.local> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: C++ as token in StandardAnalyzer? Thread-Index: Ach+LS5CHj2ABE0uRwC9LXD63YVJGwAiEOEg From: "Tom Conlon" To: X-Virus-Checked: Checked by ClamAV on apache.org Hi Donna - See previous post below that may help. Tom //////////////////////////////////////////////////////// Hi, In case this is of help to others: Crux of problem:=20 I wanted numbers and characters such as # and + to be considered. Solution: implement a LowercaseWhitespaceAnalyzer and a LowercaseWhitespaceTokenizer. i.e. IndexWriter writer =3D new IndexWriter(INDEX_DIR, new LowercaseWhitespaceAnalyzer(), true); Tom =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Diagnostics: StandardAnalyzer ---------------- Enter Querystring: (C++ AND C#) Searching for: +c +c Enter Querystring: (C\+\+ AND C\#) Searching for: +c +c Enter Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net" Searching for: ("moss 2007" "sharepoint 2007") asp.net SimpleAnalyser -------------- Enter Querystring: C++ Searching for: c Enter Querystring: C# Searching for: c Enter Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net" Searching for: (moss or sharepoint) and "asp net" WhitespaceAnalyzer ------------------ Enter Querystring: (C++ AND C#) Searching for: +C++ +C# Enter Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net" Searching for: ("moss 2007" or "sharepoint 2007") and asp.net KeywordAnalyzer --------------- Enter Querystring: (C++ AND C#) Searching for: +C++ +C# Enter Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net" Searching for: (moss 2007 or sharepoint 2007) and asp.net StopAnalyzer ------------ Enter Querystring: (C\++ AND C\#) Searching for: +c +c Enter Querystring: ("MOSS 2007" or "SHAREPOINT 2007") and "ASP.NET" Searching for: (moss sharepoint) "asp net" =20 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org =20 -----Original Message----- From: Donna L Gresh [mailto:gresh@us.ibm.com]=20 Sent: 04 March 2008 19:22 To: java-user@lucene.apache.org Subject: C++ as token in StandardAnalyzer? I saw some discussion in the archives some time ago about the fact that=20 C++ is tokenized as C in the StandardAnalyzer; this seems to still be=20 C++ the case; I was wondering if there is a simple way for me to get the behavior I want for C++ (that it is tokenized as C++) in particular, and perhaps for other more ideosyncratic terms I may have in my own application-- Thanks Donna --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org