lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From joe_coder <codetester.codetes...@gmail.com>
Subject Lucene Tokenizer + Merge terms
Date Mon, 17 Aug 2009 07:24:37 GMT

I am using a custom analyzer:


    public TokenStream tokenStream(String fieldName, Reader reader) {
        StandardTokenizer tokenStream = new StandardTokenizer(reader);
        tokenStream.setMaxTokenLength(maxTokenLength);

        TokenStream result = new ASCIIFoldingFilter(tokenStream);
        result = new StandardFilter(result);
        result = new LengthFilter(result, 3, maxTokenLength);
        result = new LowerCaseFilter(result);
        result = new StopFilter(true, result, stopSet);
        result = new PorterStemFilter(result);
        return result;
    } 

My question is around creating a new tokenizer which can detect people
name/place names etc(I will be able to lookup on my local db to find such
cases). E.g: If a text has "Joe Coder is in New York", then instead of
termvectors [Joe][Coder][New][York], I would like to have term vectors as
[Joe Coder][New York]

Are there any tokenzier in lucene that I can extend to perform this
functionality? Any other pointers?
-- 
View this message in context: http://www.nabble.com/Lucene-Tokenizer-%2B-Merge-terms-tp25002240p25002240.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message