Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: local policy includes SPF record at
 spf.trusted-forwarder.org)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.co.uk;
  h=Message-ID:Received:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type;
  b=ieERY3jhUkQHEGBd/uuoe3q7xaUvmCZNV+/GCFqyAeYPVgnsSToMtRz/rXzvwezNULw2Zwi0rDOUZ0SP8eEMzJpfcyY2Yf8r1a9XxMiAcT7XBlsGULaC6Oq3spBSgldzAXw3oWLLoc5Xvd/OTNTymlAeL/DUAMZWW+DOzCg6Z/o=
  ;
Message-ID: <20060712115026.38897.qmail@web26002.mail.ukl.yahoo.com>
Date: Wed, 12 Jul 2006 11:50:26 +0000 (GMT)
From: mark harwood <markharw00d@yahoo.co.uk>
Reply-To: mark harwood <markharw00d@yahoo.co.uk>
Subject: Re: Storing Part of Speech information in Lucene Indices
To: java-user@lucene.apache.org
In-Reply-To: <761A6386-F9CE-460B-A0B0-95B30CB225F7@uiuc.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

Could you not use a custom analyzer to inject "metadata" tokens into the index at the same position as the source tokens?

For example, given the text:
    The     cat     jumped     over     the     dog
your analyzer could emit tokens:
    [the]     [cat,_posNoun] [jumped,_posVerb]     [over]     [the]     [dog,_posNoun]

where the "_pos...." tokens have a zero position increment to effectively associate them with the term to which they relate (this is how the example SynonymTokenizer in the highlighter package works). The "_pos" prefix is used as a uniquefier for metadata tokens to avoid any name-clashes with any real content tokens.

Theoretically you could then construct queries where the queries mixed both data and your part-of-speech metadata eg you could use the position information based queries to find out what things normally have a particular verb applied to them:
     "jumped  _posNoun"~3
  or what verbs are commonly associated with a dog (caution advised here):
    "_posVerb the dog"~3
or to use an ambiguous word in a particular context/sense
    "_posVerb track"~1
 

Cheers,
Mark

----- Original Message ----
From: Amit Kumar <amitku@uiuc.edu>
To: java-user@lucene.apache.org
Cc: Amit Kumar <amitku@uiuc.edu>
Sent: Wednesday, 12 July, 2006 6:36:24 AM
Subject: Storing Part of Speech information in Lucene Indices

Hi,

A new project that I am investigating lucene for needs the  Parts of  
speech information for the tokens. I can  get that
information using NLP techniques  (GATE etc.), by pre processing the  
documents but I would like to  store that
information in the Indices. Something along the lines of

TermVectorOffsetInfo[?].getPartofSpeech();

I am writing to ask for your advice, you can tell me I am b o n k e r  
s  or let me know where I should start digging :).
Is that a good idea? Or would it be just less trouble for me to store  
the offset information along with parts of speech
outside Lucene.

Has anyone else done that?

Best,
Amit


ps: Thank you for putting the LuceneInAction source online, it was a  
great help to see the CategorizerTest.java.
I am ordering my copy of the book tomorrow :)

---------------------------------------------------------
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
---------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org