lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ganesh" <emailg...@yahoo.co.in>
Subject Re: Lucene Analyzer that can handle C++ vs C#
Date Tue, 15 Dec 2009 11:33:19 GMT
How about KeywordAnalyzer? It will treat C++ and C# as single term. 

Regards
Ganesh

----- Original Message ----- 
From: "Chris Lu" <chris.lu@gmail.com>
To: <java-user@lucene.apache.org>
Sent: Saturday, December 12, 2009 5:27 AM
Subject: Re: Lucene Analyzer that can handle C++ vs C#


> What we did in DBSight is to provide a reserved list of words for every 
> Lucene Analyzer.
> This way you can handle any special characters like C++ and C#.
> 
> Any common analyzers usually are not suitable for these special words.
> 
> --
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million
Euro funding!
> 
> 
> On 12/11/2009 9:09 AM, maxSchlein wrote:
>> Can someone please point me in the right direction.
>>
>> We are creating an application that needs to beable to search on C++ and get
>> back doc's that have C++ in it.  The StandardAnalyzer does not seem to index
>> the "+", so a search for "C++" will bring back docs that contain, C++, C,
>> C#, etc.....  The WhiteSpaceAnalyzer will index the "+", but if we have the
>> term "C++." that is, if C++ is at the end of a sentence, it will index
>> "C++." so a search for "C++" will not return the doc.  I have heard of maybe
>> a CustomAnalyzer; however, it seems like there would actually need to be a
>> CustomFilter/CustomTokenizer, I looked at:
>>       - StandardAnalyzer.java
>>       - StandardFilter.java
>>       - StandardTokenizer.java
>>       - StandardTokenizerImpl.java
>>       - StandardTokenizerImpl.jflex
>>
>> I would guess that the StandardTokenizer is where the changes would need to
>> be made to allow the "+" character, but I am unclear as to how.
>>
>> Any and all help is greatly appreciated.
>>
>> Going thru all the documents, stripping out "+" for the word "plus" is not
>> really an option for us.
>>    
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
Send instant messages to your online friends http://in.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message