Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 76725 invoked from network); 12 Jul 2006 11:50:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 12 Jul 2006 11:50:57 -0000 Received: (qmail 57812 invoked by uid 500); 12 Jul 2006 11:50:49 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 57781 invoked by uid 500); 12 Jul 2006 11:50:49 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 57769 invoked by uid 99); 12 Jul 2006 11:50:49 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Jul 2006 04:50:49 -0700 X-ASF-Spam-Status: No, hits=0.5 required=10.0 tests=DNS_FROM_RFC_ABUSE X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [217.12.10.213] (HELO web26002.mail.ukl.yahoo.com) (217.12.10.213) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 12 Jul 2006 04:50:48 -0700 Received: (qmail 38899 invoked by uid 60001); 12 Jul 2006 11:50:26 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.co.uk; h=Message-ID:Received:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=ieERY3jhUkQHEGBd/uuoe3q7xaUvmCZNV+/GCFqyAeYPVgnsSToMtRz/rXzvwezNULw2Zwi0rDOUZ0SP8eEMzJpfcyY2Yf8r1a9XxMiAcT7XBlsGULaC6Oq3spBSgldzAXw3oWLLoc5Xvd/OTNTymlAeL/DUAMZWW+DOzCg6Z/o= ; Message-ID: <20060712115026.38897.qmail@web26002.mail.ukl.yahoo.com> Received: from [193.36.230.96] by web26002.mail.ukl.yahoo.com via HTTP; Wed, 12 Jul 2006 11:50:26 GMT Date: Wed, 12 Jul 2006 11:50:26 +0000 (GMT) From: mark harwood Reply-To: mark harwood Subject: Re: Storing Part of Speech information in Lucene Indices To: java-user@lucene.apache.org In-Reply-To: <761A6386-F9CE-460B-A0B0-95B30CB225F7@uiuc.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Could you not use a custom analyzer to inject "metadata" tokens into the index at the same position as the source tokens? For example, given the text: The cat jumped over the dog your analyzer could emit tokens: [the] [cat,_posNoun] [jumped,_posVerb] [over] [the] [dog,_posNoun] where the "_pos...." tokens have a zero position increment to effectively associate them with the term to which they relate (this is how the example SynonymTokenizer in the highlighter package works). The "_pos" prefix is used as a uniquefier for metadata tokens to avoid any name-clashes with any real content tokens. Theoretically you could then construct queries where the queries mixed both data and your part-of-speech metadata eg you could use the position information based queries to find out what things normally have a particular verb applied to them: "jumped _posNoun"~3 or what verbs are commonly associated with a dog (caution advised here): "_posVerb the dog"~3 or to use an ambiguous word in a particular context/sense "_posVerb track"~1 Cheers, Mark ----- Original Message ---- From: Amit Kumar To: java-user@lucene.apache.org Cc: Amit Kumar Sent: Wednesday, 12 July, 2006 6:36:24 AM Subject: Storing Part of Speech information in Lucene Indices Hi, A new project that I am investigating lucene for needs the Parts of speech information for the tokens. I can get that information using NLP techniques (GATE etc.), by pre processing the documents but I would like to store that information in the Indices. Something along the lines of TermVectorOffsetInfo[?].getPartofSpeech(); I am writing to ask for your advice, you can tell me I am b o n k e r s or let me know where I should start digging :). Is that a good idea? Or would it be just less trouble for me to store the offset information along with parts of speech outside Lucene. Has anyone else done that? Best, Amit ps: Thank you for putting the LuceneInAction source online, it was a great help to see the CategorizerTest.java. I am ordering my copy of the book tomorrow :) --------------------------------------------------------- Amit Kumar Research Programmer The Graduate School of Library and Information Science University of Illinois, Urbana Champaign IL, 61820 phone: 217-333-4118 fax: 217-244-3302 --------------------------------------------------------- --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org