lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@syr.edu>
Subject Re: Storing Part of Speech information in Lucene Indices
Date Wed, 12 Jul 2006 11:39:35 GMT
Hi Amit,

This is definitely something you can do.   What are your goals for  
it?  Do you want to search by word and POS or do you just want POS  
available for post processing?

You could just append the POS tag onto the end of your token as it  
gets indexed, something like foo_NN or foo_ADJ.  This approach may  
mean you have to use prefix query when you want to search against  
just "foo".    You could also have a parallel field to your main  
field that stores the POS.  Then you could access it via the term  
vectors array.

Also, we have been discussing on the developers list on how to add  
payloads to a posting (i.e. store related information at a position  
in the index) similar to what Google discusses in their original  
paper.  Unfortunately, this isn't implemented yet, but if you feel  
like helping out, check out the discussion on the developer's list  
(see Flexible Indexing).

-Grant

On Jul 12, 2006, at 1:36 AM, Amit Kumar wrote:

> Hi,
>
> A new project that I am investigating lucene for needs the  Parts  
> of speech information for the tokens. I can  get that
> information using NLP techniques  (GATE etc.), by pre processing  
> the documents but I would like to  store that
> information in the Indices. Something along the lines of
>
> TermVectorOffsetInfo[?].getPartofSpeech();
>
> I am writing to ask for your advice, you can tell me I am b o n k e  
> r s  or let me know where I should start digging :).
> Is that a good idea? Or would it be just less trouble for me to  
> store the offset information along with parts of speech
> outside Lucene.
>
> Has anyone else done that?
>
> Best,
> Amit
>
>
> ps: Thank you for putting the LuceneInAction source online, it was  
> a great help to see the CategorizerTest.java.
> I am ordering my copy of the book tomorrow :)
>
> ---------------------------------------------------------
> Amit Kumar
> Research Programmer
> The Graduate School of Library and Information Science
> University of Illinois, Urbana Champaign IL, 61820
> phone: 217-333-4118 fax: 217-244-3302
> ---------------------------------------------------------
>
>
>
>
>
>



--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message