lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: How do you see if a tokenstream has tokens without consuming the tokens ?
Date Tue, 18 Oct 2011 14:25:43 GMT
Hi Paul,

On 10/18/2011 at 4:57 AM, Paul Taylor wrote:
> On 18/10/2011 06:19, Steven A Rowe wrote:
> > Another option is to create a char filter that substitutes
> > PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods,
> > etc.,
> 
> Yes that is how I first did it

No, I don't think you did.  When I say "char filter" I'm referring to CharFilter <http://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/analysis/CharFilter.html>
- this is a different kind of thing from the token filter approach you described taking previously.

Lucene Analyzers may be composed of three different kinds of components: 

* CharFilter: character-level filter; precedes the tokenizer; allows for character stream
modifications while enabling original character offsets to be maintained (to enable e.g. highlighting).
 Input: character stream; output: character stream.  An analyzer may contain zero or more
of these.

* Tokenizer: identifies character sequences that will serve as (the basis of) indexable tokens.
 Input: character stream; output: token stream. An analyzer must contain exactly one of these.

* TokenFilter: token-level filter; follows the Tokenizer; transforms, adds and/or removes
tokens to/from the token stream.  Input: token stream; output: token stream.  An analyzer
may contain zero or more of these.

> > but only when the entire input consists exclusively of whitespace and
> > punctuation.
> 
> but I couldnt work out how to only do it when exclusively whitespace and
> punctuation, any ideas to sole that _

If you go with a CharFilter, you can give it access to the entire input at once, and use a
regular expression (or something like it) to assess the input and then behave accordingly.

Steve

Mime
View raw message