lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From joe_coder <>
Subject Lucene Tokenizer + Merge terms
Date Mon, 17 Aug 2009 07:24:37 GMT

I am using a custom analyzer:

    public TokenStream tokenStream(String fieldName, Reader reader) {
        StandardTokenizer tokenStream = new StandardTokenizer(reader);

        TokenStream result = new ASCIIFoldingFilter(tokenStream);
        result = new StandardFilter(result);
        result = new LengthFilter(result, 3, maxTokenLength);
        result = new LowerCaseFilter(result);
        result = new StopFilter(true, result, stopSet);
        result = new PorterStemFilter(result);
        return result;

My question is around creating a new tokenizer which can detect people
name/place names etc(I will be able to lookup on my local db to find such
cases). E.g: If a text has "Joe Coder is in New York", then instead of
termvectors [Joe][Coder][New][York], I would like to have term vectors as
[Joe Coder][New York]

Are there any tokenzier in lucene that I can extend to perform this
functionality? Any other pointers?
View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message