lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rebecca Watson <bec.wat...@gmail.com>
Subject Re: Stop words filter
Date Wed, 23 Jun 2010 03:20:36 GMT
i guess you are using lucene 2.9 or below if you're talking about
Tokens still...

here's some old code i used to use (not sure if i wrote it or grabbed it from
online examples - its been a while since i used it!)
that grabbed the set of tokens given field name +
text to analyse (for any class that extended it.... e.g. use it for
per field analyzer
too):

public abstract class GenAnalyzer extends Analyzer {
	
	/**
	 * lucene Analyzer object
	 * @see org.apache.lucene.analysis.Analyzer
	 */
	protected Analyzer gan;
	
	/*
	 * A method to split text into tokens which are returned in the form of
	 * a TokenStream object. The text is read in using the java.io.Reader
	 * object. As analysers can be field specific the name of the field
	 * is also provided to the method.
	 *
	 * @see org.apache.lucene.analysis.Analyzer#tokenStream(java.lang.String,
java.io.Reader)
	 * @param fieldName the name of the lucene field
	 * @param reader A Reader object containing string to split into tokens
	 * @return a TokenStream that represents the string split into tokens
based on the _
	 * field name (maybe field specific analyser).
	 */
	@Override
	public TokenStream tokenStream(String fieldName, Reader reader) {
		return gan.tokenStream(fieldName, reader);
	}
	
	/**
	 * A method to split text into tokens which are returned in the form of
	 * a Token[]. The text is read in as a string.
	 * As analysers can be field specific the name of the field
	 * is also provided to the method.
	 *
	 * similar to tokenStream method accept that the parameters
	 * and return type differ.
	 *
	 * @param fieldName the name of the lucene field
	 * @param text the text to be split into tokens
	 * @return a Token[] which represents the split text tokens.
	 * @throws IOException maybe thrown by stream.next(token) call.
	 *
	 * @see org.apache.lucene.analysis.Token
	 */
	public Token[] getTokens(String fieldName, String text)
	throws IOException {
		TokenStream stream = gan.tokenStream(fieldName, new StringReader(text));
		ArrayList<Token> tokenList = new ArrayList<Token>();
		Token token = new Token();
		while(true){
			token = stream.next(token);
			if (token == null) break;
			tokenList.add((Token) token.clone());
		}
		//stream.end();
		return tokenList.toArray(new Token[0]);
	}
}

hope that helps, i haven't used this code for a while but it worked
when i used it last!

in lucene 2.9 the stream.next(token) method is deprecated... and
if you move to lucene 3 i think that's where the attributesources replace tokens
so all this code will need to be ported...

thanks :)

bec

On 23 June 2010 10:49, Vinicius Carvalho <viniciusccarvalho@gmail.com> wrote:
> Hello there! I've been using lucene as a Fult Text Search solution for some
> time. And  although I'm familiar with Analyzers and Stemmers I never used
> them directly.
>
> I'm testing a few experiments on Sentiment Analysis and our implementation
> needs to perform stemming and stop word removal. I thought using lucene
> built-in support to spare me some coding time.
>
> Is there any example? I'm trying
>
> TokenStream stream = analyzer.tokenStream("", new StringReader(inputStr));
>
> Problem is that I could not find a way to get the result tokens. I was
> expecting something like stream.getTokens:Token[] :P
>
> Could someone point me in the right direction?
>
> Regards
>
> --
> The intuitive mind is a sacred gift and the
> rational mind is a faithful servant. We have
> created a society that honors the servant and
> has forgotten the gift.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message