lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wolfgang Hoschek <whosc...@lbl.gov>
Subject Re: contrib: keywordTokenStream
Date Wed, 04 May 2005 05:14:27 GMT
On May 3, 2005, at 5:26 PM, Erik Hatcher wrote:

> Wolfgang,
>
> I've now added this.

Thanks :-)

> I'm not seeing how this could be generally useful.  I'm curious how 
> you are using it and why it is better suited for what you're doing 
> than any other analyzer.
>
> "keyword tokenizer" is a bit overloaded terminology-wise, though - 
> look in the contrib/analyzers/src/java area to see what I mean.
>
>     Erik

The difference between this and the KeywordTokenizer from the 
contrib/analyzer is that it

- can operate on multiple keywords rather than just a single one. So 
it's slighly more general.
- Takes a collection (typically of String values) as a input rather 
than a Reader. I can see the java.io.Reader scalability rationale used 
throughout the analysis APIs, but for many use cases (including my own) 
Strings are a lot handier (and more efficient to deal with) - the 
string values are small anyway.

So it's a convenient way to add terms (keywords if you like) that have 
been parsed/massaged into string(s) by some existing external means 
(e.g. grouped regex scanning of legacy formatted text files into 
various fields, etc) into an index "as is", without any further 
transforming analysis. Most folks could write such a (non-essential) 
utility themselves but it's handy in a similar way that you have the 
Field.Keyword convenience infrastructure...

> "keyword tokenizer" is a bit overloaded terminology-wise, though

If you come up with a better name feel free to rename it.

Wolfgang.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message