Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Date: Fri, 14 Dec 2007 15:40:16 -0800 (PST)
From: Chris Hostetter <hossman_lucene@fucit.org>
To: java-user@lucene.apache.org
Subject: Re: Basic Named Entity Indexing
In-Reply-To: <14291880.post@talk.nabble.com>
Message-ID: <Pine.LNX.4.62.0712141529270.541@radix.cryptio.net>
References: <14291880.post@talk.nabble.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII


: a) index the documents by wrapping the whitespace analyzer with
: ngramanalyzerwrapper and then retrieving only the words which have 3 or more
: characters and start with a capital, filtering the "garbage" manually.
: b) creating my own analyzer which will only index ngrams that start with
: capital letters and then retrieving the indexed words.

: how would i go about creating my own analyzer? (i've read lucene in action
: and it wasn't much help :s)

Start by writing yourself a "NamedEntityTokenFilter" ... look at the 
StopFilter to give yourself an idea what it should look like ... whenever 
someone calls "next()" on your filter, keep calling "next() on whatever 
TokenStream you've got, untill you get something you consider a "named 
entity" and then return it.

An Analyzer is any class which takes in an InputStream and outputs Tokens 
... typically they are really really simple and just delegate the hard 
work to a Tokenizer and 0 or more TokenFilters ... if you look at the 
source code for the "tokenStream" method of most analyzers in Lucene 
you'll see it can be really easy to write one by reusing an existing 
Tokenizer (it sounds like you want to tokenize on whitespace) and your new 
TokenFilter.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org