Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 74309 invoked from network); 23 Mar 2004 17:12:42 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 23 Mar 2004 17:12:42 -0000 Received: (qmail 60182 invoked by uid 500); 23 Mar 2004 17:12:33 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 60161 invoked by uid 500); 23 Mar 2004 17:12:33 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 60147 invoked from network); 23 Mar 2004 17:12:33 -0000 Received: from unknown (HELO rwcrmhc12.comcast.net) (216.148.227.85) by daedalus.apache.org with SMTP; 23 Mar 2004 17:12:33 -0000 Received: from apache.org (c-24-5-145-151.client.comcast.net[24.5.145.151]) by comcast.net (rwcrmhc12) with ESMTP id <20040323171236014005nc4ee>; Tue, 23 Mar 2004 17:12:36 +0000 Message-ID: <40606FC8.3070709@apache.org> Date: Tue, 23 Mar 2004 09:11:36 -0800 From: Doug Cutting User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040116 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Developers List Subject: Re: Token declared final ? References: <405CFE0B.9070606@cs.york.ac.uk> In-Reply-To: <405CFE0B.9070606@cs.york.ac.uk> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N The 'type' field of Token would be a good place for Part-of-Speech. Does that work for you? If not, perhaps we should make Token non-final. As has been discussed before, Lucene uses final for two reasons. The first is historical: long ago it used to make things faster by permitting javac to inline things. The second is that some classes are not designed to be subclassed, e.g., subclassing Field or Document will generally cause more confusion than it will simplify an application. The problem is sometimes determining which case is which. Doug Thimal Jayasooriya wrote: > Hi all: > I have a question about the class structure of Tokens and > Tokenizers. Apologies, it's a bit longwinded :) > > As part of my Masters research, I'm trying to use Lucene to store > different semantic classes found within documents. For this, I need to > first split sentences and then generate part of speech (POS) information > for each significant word found within a particular document. Through > separate libraries, I've already done the splitting and tagging tasks. > > When I looked at the source for Token > (org.apache.lucene.analysis.token), however, I found that it has been > declared final. I had intended to subclass Token to also keep a POS > marker and use it later within the Analyzer. Could someone please give > me some information on why Token was declared as final ? I am sure I've > missed something, but I can't see what it is.. Alternately, does it > makes more sense to store the POS information elsewhere ? I would > probably need it at index time only. > > My original intention was to extend the Tokenizer > (org.apache.lucene.analysis.Tokenizer), get POS information, add it to > the token and then do the normal consumption of punctuation and so on > with JavaCC. Punctuation is necessary to recognize some named entities, > so I need to do this before those tokens are consumed. Is there a better > / more logical place to perform POS tagging ? > > Thanks, > Thimal > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org