Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: local policy)
Mime-Version: 1.0 (Apple Message framework v752.2)
In-Reply-To: <BDCA2945-F037-4061-8925-9C4F86D6053C@uiuc.edu>
References: <761A6386-F9CE-460B-A0B0-95B30CB225F7@uiuc.edu>
 <5613A05C-E96A-4960-91F5-370E10C1EC0B@syr.edu>
 <BDCA2945-F037-4061-8925-9C4F86D6053C@uiuc.edu>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <DC0D1A1B-BB64-40D0-A02E-DA1D0C4C8CDA@syr.edu>
Content-Transfer-Encoding: 7bit
From: Grant Ingersoll <gsingers@syr.edu>
Subject: Re: Storing Part of Speech information in Lucene Indices
Date: Wed, 12 Jul 2006 12:50:45 -0400
To: java-user@lucene.apache.org

I think Mark's idea is better for this.  Although I seem to recall  
there being some caveats w/ multiple tokens at the same position, but  
I don't remember the details.  I _think_ term vectors don't like it,  
so if you need them, you might have troubles.  Perhaps a search of  
the mailing lists and JIRA might turn up something or maybe someone  
else remembers.  At any rate, it may not effect you, so I would try  
Mark's suggestion and see if it works.

-Grant

On Jul 12, 2006, at 11:15 AM, Amit Kumar wrote:

> We need to be able to search by word and POS and also have POS  
> available for each occurrence.  Appending POS to the terms will  
> create post processing nightmare to retrieve
> term frequencies right? (I would have to add all the foo_NN and  
> foo_ADJ etc.).
>
> I can store the POS in a parallel field and access it via term  
> vectors, but that wouldn't allow any kind of search on POS related  
> fields right?  For example if I wanted to search for any
> adjective with in 3 words of say a term or say If I wanted to get  
> all the patterns that follow the sequence ADJ NN ADJ.
>
> Let me look in the developer archives for the payload discussions,  
> perhaps implementing that might satisfy my use cases.
>
> Comments?
>
> -Thanks
> Amit
>
>
>
> On Jul 12, 2006, at 6:39 AM, Grant Ingersoll wrote:
>
>> Hi Amit,
>>
>> This is definitely something you can do.   What are your goals for  
>> it?  Do you want to search by word and POS or do you just want POS  
>> available for post processing?
>>
>> You could just append the POS tag onto the end of your token as it  
>> gets indexed, something like foo_NN or foo_ADJ.  This approach may  
>> mean you have to use prefix query when you want to search against  
>> just "foo".    You could also have a parallel field to your main  
>> field that stores the POS.  Then you could access it via the term  
>> vectors array.
>>
>> Also, we have been discussing on the developers list on how to add  
>> payloads to a posting (i.e. store related information at a  
>> position in the index) similar to what Google discusses in their  
>> original paper.  Unfortunately, this isn't implemented yet, but if  
>> you feel like helping out, check out the discussion on the  
>> developer's list (see Flexible Indexing).
>>
>> -Grant
>>
>> On Jul 12, 2006, at 1:36 AM, Amit Kumar wrote:
>>
>>> Hi,
>>>
>>> A new project that I am investigating lucene for needs the  Parts  
>>> of speech information for the tokens. I can  get that
>>> information using NLP techniques  (GATE etc.), by pre processing  
>>> the documents but I would like to  store that
>>> information in the Indices. Something along the lines of
>>>
>>> TermVectorOffsetInfo[?].getPartofSpeech();
>>>
>>> I am writing to ask for your advice, you can tell me I am b o n k  
>>> e r s  or let me know where I should start digging :).
>>> Is that a good idea? Or would it be just less trouble for me to  
>>> store the offset information along with parts of speech
>>> outside Lucene.
>>>
>>> Has anyone else done that?
>>>
>>> Best,
>>> Amit
>>>
>>>
>>> ps: Thank you for putting the LuceneInAction source online, it  
>>> was a great help to see the CategorizerTest.java.
>>> I am ordering my copy of the book tomorrow :)
>>>
>>> ---------------------------------------------------------
>>> Amit Kumar
>>> Research Programmer
>>> The Graduate School of Library and Information Science
>>> University of Illinois, Urbana Champaign IL, 61820
>>> phone: 217-333-4118 fax: 217-244-3302
>>> ---------------------------------------------------------
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> --------------------------
>> Grant Ingersoll
>> Sr. Software Engineer
>> Center for Natural Language Processing
>> Syracuse University
>> 335 Hinds Hall
>> Syracuse, NY 13244
>> http://www.cnlp.org
>>
>> Voice: 315-443-5484
>> Fax: 315-443-6886
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------
> Amit Kumar
> Research Programmer
> The Graduate School of Library and Information Science
> University of Illinois, Urbana Champaign IL, 61820
> phone: 217-333-4118 fax: 217-244-3302
> ---------------------------------------------------------
>
>
>
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org