Hi ayyanar,
I should have mentioned in my previous email that the general@lucene.apache.org mailing list
has very few subscribers - you'll get much better response on the java-user@l.a.o mailing
list.
On 01/05/2009 at 3:07 PM, ayyanar wrote:
> My objective is to retain the keyword (input stream) as is a token like
> a keyword tokenizer does and also split the keyword by whitespace and
> maintain that tokens as a white space tokenizer does
Right, ShingleFilter won't do this for you.
The following, if used to filter WhitespaceTokenizer's output, is similar to what you want
(note: untested, and also note that this assumes you're using Lucene v2.4.0, and not a recent
trunk version, which includes the new TokenStream API introduced with LUCENE-1422: <https://issues.apache.org/jira/browse/LUCENE-1422>):
-----
/**
* Extends CachingTokenFilter to output a space-separated-
* concatenated-all-input-stream-terms token, followed by
* all of the original input stream tokens.
* One for all and (then) all for one!
*/
public class ThreeMusketeersFilter extends CachingTokenFilter {
private boolean concatenatedTokenOutput = false;
public ThreeMusketeersFilter(TokenStream input) {
super(input);
}
public Token next(final Token reusableToken) throws IOException {
assert reusableToken != null;
if (concatenatedTokenOutput) {
return super.next(reusableToken);
} else {
concatenatedTokenOutput = true;
Token firstToken = super.next(reusableToken);
if (firstToken == null) {
return null;
}
StringBuffer buffer = new StringBuffer();
buffer.append(firstToken.termBuffer());
int start = firstToken.startOffset();
int end = firstToken.endOffset();
for (Token nextToken = super.next(reusableToken) ;
nextToken != null ;
nextToken = super.next(reusableToken)) {
end = nextToken.endOffset();
buffer.append(' '); // add a space between terms
buffer.append(nextToken.termBuffer());
}
reusableToken.clear();
reusableToken.resizeTermBuffer(buffer.length());
reusableToken.setTermLength(buffer.length());
buffer.getChars(0, buffer.length(), reusableToken.termBuffer(), 0);
reusableToken.setStartOffset(start);
reusableToken.setEndOffset(end);
super.reset(); // Rewind input stream to get the individual tokens
return reusableToken;
}
}
public void reset() throws IOException {
super.reset();
concatenatedTokenOutput = false;
}
}
|