lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Kumar <ami...@uiuc.edu>
Subject Re: Storing Part of Speech information in Lucene Indices
Date Wed, 12 Jul 2006 15:59:51 GMT
You are right. I saw your email after pressing  send. Let me  
experiment. Thanks for the tip.

Best,
Amit

On Jul 12, 2006, at 10:55 AM, mark harwood wrote:

>>> Appending POS to the terms will create post processing nightmare
>
> I think you may have missed the subtle distinction between Grant's  
> suggestion and mine.
>
> His suggestion was to append your POS info to the source token -  
> creating a single token which combined both the original content  
> and your POS info.
> My suggestion was to generate *two* tokens - one is the original  
> source token, untouched, the other is the POS token but the trick  
> is to set the "position increment" to 0 for the POS token. This  
> ensures that the POS token is considered to occur at exactly the  
> same location as the source token. The original source token is  
> untampered with and has all the usual stats in the index.The POS  
> token and/or source token can be used interchangeably in queries on  
> the same field so you can use position info eg phrase or span queries.
>
> Cheers
> Mark
>
>
>
> ----- Original Message ----
> From: Amit Kumar <amitku@uiuc.edu>
> To: java-user@lucene.apache.org
> Sent: Wednesday, 12 July, 2006 4:15:34 PM
> Subject: Re: Storing Part of Speech information in Lucene Indices
>
> We need to be able to search by word and POS and also have POS
> available for each occurrence.  Appending POS to the terms will
> create post processing nightmare to retrieve
> term frequencies right? (I would have to add all the foo_NN and
> foo_ADJ etc.).
>
> I can store the POS in a parallel field and access it via term
> vectors, but that wouldn't allow any kind of search on POS related
> fields right?  For example if I wanted to search for any
> adjective with in 3 words of say a term or say If I wanted to get all
> the patterns that follow the sequence ADJ NN ADJ.
>
> Let me look in the developer archives for the payload discussions,
> perhaps implementing that might satisfy my use cases.
>
> Comments?
>
> -Thanks
> Amit
>
>
>
> On Jul 12, 2006, at 6:39 AM, Grant Ingersoll wrote:
>
>> Hi Amit,
>>
>> This is definitely something you can do.   What are your goals for
>> it?  Do you want to search by word and POS or do you just want POS
>> available for post processing?
>>
>> You could just append the POS tag onto the end of your token as it
>> gets indexed, something like foo_NN or foo_ADJ.  This approach may
>> mean you have to use prefix query when you want to search against
>> just "foo".    You could also have a parallel field to your main
>> field that stores the POS.  Then you could access it via the term
>> vectors array.
>>
>> Also, we have been discussing on the developers list on how to add
>> payloads to a posting (i.e. store related information at a position
>> in the index) similar to what Google discusses in their original
>> paper.  Unfortunately, this isn't implemented yet, but if you feel
>> like helping out, check out the discussion on the developer's list
>> (see Flexible Indexing).
>>
>> -Grant
>>
>> On Jul 12, 2006, at 1:36 AM, Amit Kumar wrote:
>>
>>> Hi,
>>>
>>> A new project that I am investigating lucene for needs the  Parts
>>> of speech information for the tokens. I can  get that
>>> information using NLP techniques  (GATE etc.), by pre processing
>>> the documents but I would like to  store that
>>> information in the Indices. Something along the lines of
>>>
>>> TermVectorOffsetInfo[?].getPartofSpeech();
>>>
>>> I am writing to ask for your advice, you can tell me I am b o n k
>>> e r s  or let me know where I should start digging :).
>>> Is that a good idea? Or would it be just less trouble for me to
>>> store the offset information along with parts of speech
>>> outside Lucene.
>>>
>>> Has anyone else done that?
>>>
>>> Best,
>>> Amit
>>>
>>>
>>> ps: Thank you for putting the LuceneInAction source online, it was
>>> a great help to see the CategorizerTest.java.
>>> I am ordering my copy of the book tomorrow :)
>>>
>>> ---------------------------------------------------------
>>> Amit Kumar
>>> Research Programmer
>>> The Graduate School of Library and Information Science
>>> University of Illinois, Urbana Champaign IL, 61820
>>> phone: 217-333-4118 fax: 217-244-3302
>>> ---------------------------------------------------------
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> --------------------------
>> Grant Ingersoll
>> Sr. Software Engineer
>> Center for Natural Language Processing
>> Syracuse University
>> 335 Hinds Hall
>> Syracuse, NY 13244
>> http://www.cnlp.org
>>
>> Voice: 315-443-5484
>> Fax: 315-443-6886
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------
> Amit Kumar
> Research Programmer
> The Graduate School of Library and Information Science
> University of Illinois, Urbana Champaign IL, 61820
> phone: 217-333-4118 fax: 217-244-3302
> ---------------------------------------------------------
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
---------------------------------------------------------





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message