lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom Conlon" <t...@2ls.com>
Subject RE: C++ as token in StandardAnalyzer?
Date Wed, 05 Mar 2008 11:41:34 GMT
Hi Donna - See previous post below that may help. Tom
////////////////////////////////////////////////////////
Hi,

In case this is of help to others:

Crux of problem: 
I wanted numbers and characters such as # and + to be considered.

Solution:
implement a LowercaseWhitespaceAnalyzer and a
LowercaseWhitespaceTokenizer.

i.e.
IndexWriter writer = new IndexWriter(INDEX_DIR, new
LowercaseWhitespaceAnalyzer(), true);

Tom
=======================================================================
Diagnostics:

StandardAnalyzer
----------------
Enter Querystring: (C++ AND C#)      Searching for: +c +c
Enter Querystring: (C\+\+ AND C\#)   Searching for: +c +c
Enter Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net"
Searching for: ("moss 2007" "sharepoint 2007") asp.net

SimpleAnalyser
--------------
Enter Querystring: C++ Searching for: c
Enter Querystring: C#  Searching for: c
Enter Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net"
Searching for: (moss or sharepoint) and "asp net"

WhitespaceAnalyzer
------------------
Enter Querystring: (C++ AND C#)  Searching for: +C++ +C# Enter
Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net"
Searching for: ("moss 2007" or "sharepoint 2007") and asp.net

KeywordAnalyzer
---------------
Enter Querystring: (C++ AND C#) Searching for: +C++ +C# Enter
Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net"
Searching for: (moss 2007 or sharepoint 2007) and asp.net

StopAnalyzer
------------
Enter Querystring: (C\++ AND C\#)  Searching for: +c +c Enter
Querystring: ("MOSS 2007" or "SHAREPOINT 2007") and "ASP.NET"
Searching for: (moss sharepoint) "asp net"
 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
 

-----Original Message-----
From: Donna L Gresh [mailto:gresh@us.ibm.com] 
Sent: 04 March 2008 19:22
To: java-user@lucene.apache.org
Subject: C++ as token in StandardAnalyzer?

I saw some discussion in the archives some time ago about the fact that 
C++ is tokenized as C in the StandardAnalyzer; this seems to still be 
C++ the
case; I was wondering if there is a simple way for me to get the
behavior I want for C++ (that it is tokenized as C++) in particular, and
perhaps for other more ideosyncratic terms I may have in my own
application-- Thanks Donna



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message