From java-user-return-54420-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Thu Dec 13 22:32:35 2012 Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7AA11D9E0 for ; Thu, 13 Dec 2012 22:32:35 +0000 (UTC) Received: (qmail 5996 invoked by uid 500); 13 Dec 2012 22:32:33 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 5950 invoked by uid 500); 13 Dec 2012 22:32:33 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Delivered-To: moderator for java-user@lucene.apache.org Received: (qmail 95587 invoked by uid 99); 13 Dec 2012 22:30:16 -0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sujitatgtalk@gmail.com designates 209.85.160.48 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; bh=GfxQ4WQqHUp2NacV8D7fD7X38T+EJR7wLWA/V6wYC60=; b=zsRKnVo4AYFI5hro3z3TjOSsvIw45rLZhFH1l+aykOSrfBaPc431CPQjTXp2rr3+6E TIc1xsbYDfcWWswlEfeEQK28GimatM+sPCYtiJiIp5UOAYOzYH1hDijTc8TLM+jYd3pP 4HIl2f8g8jgFF+HnOLsQGVfYnxMA9Aaivztd934oCRwk6Y1Jls8vOk4PB2yFYRNtMy9K klZvTYHXJlX6qz11k1itA2SzksLNBE64bDhD1T4f7S7r05yYyNvDmkZHNENKh1RAq6pM E9Zd/aXD2VBx+FOUK/2Kod1enKFP4pkjwjhU3kOcVaXi/Ksr9BKpL4JjzdRe7KGSenrM 5Wlg== Sender: Sujit Pal Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1085) Subject: Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs? From: SUJIT PAL In-Reply-To: Date: Thu, 13 Dec 2012 14:29:41 -0800 Content-Transfer-Encoding: quoted-printable Message-Id: <435F20F3-7ED5-45A0-877D-DD39300A8926@comcast.net> References: <5E12F7FD7D58D54DA5542ECC8B1451980205B3@MSGPEXCHA28B.mfad.mfroot.org> <50CA4C73.7010607@gmail.com> <50CA509F.308@gmail.com> To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.1085) X-Virus-Checked: Checked by ClamAV on apache.org Hi Glen, I don't believe you can attach a single payload to multiple tokens. What = I did for a similar requirement was to combine the tokens into a single = "_" delimited single token and attached the payload to it. For example: The Big Bad Wolf huffed and puffed and blew the house of the Three = Little Pigs down. Now assume "Big Bad Wolf" and "Three Little Pigs" are spans to which I = would like to attach payloads to. I run the tokens through a custom = tokenizer that produces: The Big_Bad_Wolf$payload1 huffed and puffed and blew the house of the = Three_Little_Pigs$payload2 down. In my case this makes sense, ie I can treat the span as a single unit. = Not sure about your use case. HTH Sujit On Dec 13, 2012, at 2:08 PM, Glen Newton wrote: > Cool! Sounds great! :-) >=20 > Any pointers to a (Lucene) example that attaches a payload to a > start..end span that is more than one token? >=20 > thanks, > -Glen >=20 > On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog = wrote: >> I should not have added that note. The Opennlp patch gives a concrete >> example of adding an annotation to text. >>=20 >>=20 >> On 12/13/2012 01:54 PM, Glen Newton wrote: >>>=20 >>> It is not clear this is exactly what is needed/being discussed. >>>=20 >>> =46rom the issue: >>> "We are also planning a Tokenizer/TokenFilter that can put parts of >>> speech as either payloads (PartOfSpeechAttribute?) on a token or at >>> the same position." >>>=20 >>> This adds it to a token, not a span. 'same position' does not = suggest >>> it also records the end position. >>>=20 >>> -Glen >>>=20 >>> On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog = wrote: >>>>=20 >>>> Parts-of-speech is available now, in the indexer. >>>>=20 >>>> LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does >>>> parts-of-speech, chunking and Named Entity Recognition. OpenNLP is = an >>>> Apache >>>> project for natural-language processing. >>>>=20 >>>> Some parts are in Solr that could be in Lucene. >>>>=20 >>>> https://issues.apache.org/jira/browse/lucene-2899 >>>>=20 >>>>=20 >>>> On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote: >>>>>>>=20 >>>>>>> Is there any (preliminary) code checked in somewhere that I can = look >>>>>>> at, >>>>>>> that would help me understand the practical issues that would = need to >>>>>>> be >>>>>>> addressed? >>>>>>=20 >>>>>> Maybe we can make this more concrete: what new attribute are you >>>>>> needing to record in the postings and access at search time? >>>>>=20 >>>>> For example: >>>>> - part of speech of a token. >>>>> - syntactic parse subtree (over a span). >>>>> - semantically normalized phrase (to canonical text or = ontological >>>>> code). >>>>> - semantic group (of a span). >>>>> - coreference link. >>>>>=20 >>>>> stephen >>>>>=20 >>>>>=20 >>>>> = --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>>=20 >>>=20 >>>=20 >>=20 >>=20 >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >>=20 >=20 >=20 >=20 > --=20 > - > http://zzzoot.blogspot.com/ > - >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org