Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 18412 invoked from network); 14 Dec 2007 23:40:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 14 Dec 2007 23:40:51 -0000 Received: (qmail 3735 invoked by uid 500); 14 Dec 2007 23:40:34 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 3701 invoked by uid 500); 14 Dec 2007 23:40:34 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 3682 invoked by uid 99); 14 Dec 2007 23:40:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Dec 2007 15:40:34 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.69.42.181] (HELO radix.cryptio.net) (208.69.42.181) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Dec 2007 23:40:13 +0000 Received: by radix.cryptio.net (Postfix, from userid 1007) id EAC9F71C0AA; Fri, 14 Dec 2007 15:40:16 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by radix.cryptio.net (Postfix) with ESMTP id E4FB971C074 for ; Fri, 14 Dec 2007 15:40:16 -0800 (PST) Date: Fri, 14 Dec 2007 15:40:16 -0800 (PST) From: Chris Hostetter To: java-user@lucene.apache.org Subject: Re: Basic Named Entity Indexing In-Reply-To: <14291880.post@talk.nabble.com> Message-ID: References: <14291880.post@talk.nabble.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org : a) index the documents by wrapping the whitespace analyzer with : ngramanalyzerwrapper and then retrieving only the words which have 3 or more : characters and start with a capital, filtering the "garbage" manually. : b) creating my own analyzer which will only index ngrams that start with : capital letters and then retrieving the indexed words. : how would i go about creating my own analyzer? (i've read lucene in action : and it wasn't much help :s) Start by writing yourself a "NamedEntityTokenFilter" ... look at the StopFilter to give yourself an idea what it should look like ... whenever someone calls "next()" on your filter, keep calling "next() on whatever TokenStream you've got, untill you get something you consider a "named entity" and then return it. An Analyzer is any class which takes in an InputStream and outputs Tokens ... typically they are really really simple and just delegate the hard work to a Tokenizer and 0 or more TokenFilters ... if you look at the source code for the "tokenStream" method of most analyzers in Lucene you'll see it can be really easy to write one by reusing an existing Tokenizer (it sounds like you want to tokenize on whitespace) and your new TokenFilter. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org