lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thimal Jayasooriya <thi...@cs.york.ac.uk>
Subject Token declared final ?
Date Sun, 21 Mar 2004 02:29:31 GMT
Hi all:
     I have a question about the class structure of Tokens and 
Tokenizers. Apologies, it's a bit longwinded :)

    As part of my Masters research, I'm trying to use Lucene to store 
different semantic classes found within documents. For this, I need to 
first split sentences and then generate part of speech (POS) information 
for each significant word found within a particular document. Through 
separate libraries, I've already done the splitting and tagging tasks.

    When I looked at the source for Token 
(org.apache.lucene.analysis.token), however, I found that it has been 
declared final. I had intended to subclass Token to also keep a POS 
marker and use it later within the Analyzer. Could someone please give 
me some information on why Token was declared as final ? I am sure I've 
missed something, but I can't see what it is.. Alternately, does it 
makes more sense to store the POS information elsewhere ? I would 
probably need it at index time only.

     My original intention was to extend the Tokenizer 
(org.apache.lucene.analysis.Tokenizer), get POS information, add it to 
the token and then do the normal consumption of punctuation and so on 
with JavaCC. Punctuation is necessary to recognize some named entities, 
so I need to do this before those tokens are consumed. Is there a better 
/ more logical place to perform POS tagging ?

Thanks,
Thimal

-- 
Thimal Jayasooriya,
Department of Computer Science,
The University of York
http://www.cs.york.ac.uk/~thimal/


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message