lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: C++ as token in StandardAnalyzer?
Date Tue, 04 Mar 2008 21:18:18 GMT
 Almost by definition, you have to write your own analyzer. This may be as
simple as chaining another filter into one of the regular analyzers or as
complex as defining your own grammar.

As far as I know, there's no "keep word" list. But that would be an
interesting addition. That is, a variety of analyzer that you not only
passed a list of stop words to, but also passed a list of "keep words",
or words that should NOT be massaged at all. I can imagine that this
would get pretty tricky for, say, StandardAnalyzer, but something like
this in the chain of WhitespaceTokenizer >> LowercaseFilter >>
KeepwordFilter might be useful...

All this right off the top of my head without much thought, but....

Best
Erick

On Tue, Mar 4, 2008 at 2:22 PM, Donna L Gresh <gresh@us.ibm.com> wrote:

> I saw some discussion in the archives some time ago about the fact that
> C++ is tokenized as C in the StandardAnalyzer; this seems to still be the
> case; I was wondering if there is a simple way for me to get the behavior
> I want for C++ (that it is tokenized as C++) in particular, and perhaps
> for other more ideosyncratic terms I may have in my own application--
> Thanks
> Donna
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message