Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 2588 invoked from network); 12 Jul 2006 16:51:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 12 Jul 2006 16:51:17 -0000 Received: (qmail 78615 invoked by uid 500); 12 Jul 2006 16:51:07 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 78577 invoked by uid 500); 12 Jul 2006 16:51:07 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 78566 invoked by uid 99); 12 Jul 2006 16:51:07 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Jul 2006 09:51:07 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [128.230.18.29] (HELO mailer.syr.edu) (128.230.18.29) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Jul 2006 09:51:06 -0700 Received: from [10.65.253.200] (cnlpvpn.syr.edu) by mailer.syr.edu (LSMTP for Windows NT v1.1b) with SMTP id <0.15559A57@mailer.syr.edu>; Wed, 12 Jul 2006 12:50:44 -0400 Mime-Version: 1.0 (Apple Message framework v752.2) In-Reply-To: References: <761A6386-F9CE-460B-A0B0-95B30CB225F7@uiuc.edu> <5613A05C-E96A-4960-91F5-370E10C1EC0B@syr.edu> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Grant Ingersoll Subject: Re: Storing Part of Speech information in Lucene Indices Date: Wed, 12 Jul 2006 12:50:45 -0400 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.752.2) X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I think Mark's idea is better for this. Although I seem to recall there being some caveats w/ multiple tokens at the same position, but I don't remember the details. I _think_ term vectors don't like it, so if you need them, you might have troubles. Perhaps a search of the mailing lists and JIRA might turn up something or maybe someone else remembers. At any rate, it may not effect you, so I would try Mark's suggestion and see if it works. -Grant On Jul 12, 2006, at 11:15 AM, Amit Kumar wrote: > We need to be able to search by word and POS and also have POS > available for each occurrence. Appending POS to the terms will > create post processing nightmare to retrieve > term frequencies right? (I would have to add all the foo_NN and > foo_ADJ etc.). > > I can store the POS in a parallel field and access it via term > vectors, but that wouldn't allow any kind of search on POS related > fields right? For example if I wanted to search for any > adjective with in 3 words of say a term or say If I wanted to get > all the patterns that follow the sequence ADJ NN ADJ. > > Let me look in the developer archives for the payload discussions, > perhaps implementing that might satisfy my use cases. > > Comments? > > -Thanks > Amit > > > > On Jul 12, 2006, at 6:39 AM, Grant Ingersoll wrote: > >> Hi Amit, >> >> This is definitely something you can do. What are your goals for >> it? Do you want to search by word and POS or do you just want POS >> available for post processing? >> >> You could just append the POS tag onto the end of your token as it >> gets indexed, something like foo_NN or foo_ADJ. This approach may >> mean you have to use prefix query when you want to search against >> just "foo". You could also have a parallel field to your main >> field that stores the POS. Then you could access it via the term >> vectors array. >> >> Also, we have been discussing on the developers list on how to add >> payloads to a posting (i.e. store related information at a >> position in the index) similar to what Google discusses in their >> original paper. Unfortunately, this isn't implemented yet, but if >> you feel like helping out, check out the discussion on the >> developer's list (see Flexible Indexing). >> >> -Grant >> >> On Jul 12, 2006, at 1:36 AM, Amit Kumar wrote: >> >>> Hi, >>> >>> A new project that I am investigating lucene for needs the Parts >>> of speech information for the tokens. I can get that >>> information using NLP techniques (GATE etc.), by pre processing >>> the documents but I would like to store that >>> information in the Indices. Something along the lines of >>> >>> TermVectorOffsetInfo[?].getPartofSpeech(); >>> >>> I am writing to ask for your advice, you can tell me I am b o n k >>> e r s or let me know where I should start digging :). >>> Is that a good idea? Or would it be just less trouble for me to >>> store the offset information along with parts of speech >>> outside Lucene. >>> >>> Has anyone else done that? >>> >>> Best, >>> Amit >>> >>> >>> ps: Thank you for putting the LuceneInAction source online, it >>> was a great help to see the CategorizerTest.java. >>> I am ordering my copy of the book tomorrow :) >>> >>> --------------------------------------------------------- >>> Amit Kumar >>> Research Programmer >>> The Graduate School of Library and Information Science >>> University of Illinois, Urbana Champaign IL, 61820 >>> phone: 217-333-4118 fax: 217-244-3302 >>> --------------------------------------------------------- >>> >>> >>> >>> >>> >>> >> >> >> >> -------------------------- >> Grant Ingersoll >> Sr. Software Engineer >> Center for Natural Language Processing >> Syracuse University >> 335 Hinds Hall >> Syracuse, NY 13244 >> http://www.cnlp.org >> >> Voice: 315-443-5484 >> Fax: 315-443-6886 >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> > > --------------------------------------------------------- > Amit Kumar > Research Programmer > The Graduate School of Library and Information Science > University of Illinois, Urbana Champaign IL, 61820 > phone: 217-333-4118 fax: 217-244-3302 > --------------------------------------------------------- > > > > -------------------------- Grant Ingersoll Sr. Software Engineer Center for Natural Language Processing Syracuse University 335 Hinds Hall Syracuse, NY 13244 http://www.cnlp.org Voice: 315-443-5484 Skype: grant_ingersoll Fax: 315-443-6886 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org