Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 68003 invoked from network); 12 Oct 2009 16:35:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Oct 2009 16:35:38 -0000 Received: (qmail 41003 invoked by uid 500); 12 Oct 2009 16:35:36 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 40946 invoked by uid 500); 12 Oct 2009 16:35:35 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 40936 invoked by uid 99); 12 Oct 2009 16:35:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Oct 2009 16:35:35 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=BAYES_00 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of till.kolter@googlemail.com designates 209.85.219.226 as permitted sender) Received: from [209.85.219.226] (HELO mail-ew0-f226.google.com) (209.85.219.226) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Oct 2009 16:35:33 +0000 Received: by ewy26 with SMTP id 26so9613757ewy.5 for ; Mon, 12 Oct 2009 09:35:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=zlAIfDsXMiTiwZ8ZJJ5c44ua0Mvlodrtq8PxHghmw9c=; b=p1ok2sYGI3hWS0tStaNeYcMUb2XNjC5pPd7XjjXe6Dfu+Bf3uqsYfsHn+QG0nh+wN6 d1PpZLN5I7XLBqq9ea/jHV3oaHJlLTlIs0rCBVvC/06vsO9W+apzgQDy3e9Z9XqMaFH8 zGYKjmuqPhqNUttQtD8fHkL1nVzk1rZ5qVbT4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=iOQGI6mNWi8DB5qf9TCplqBwW4NRcVFMv0Fk4P00QOcJLE6ycDLyEKqGe4v3AVGXE8 qITNshLmk2XNJmJpvliinI4tcfDPGKuL9MBS8F9RWX/NFr4/87cXwImpSg4dCena8n7t EepfaA463+SJ2gyHs/jBd1Ns2BuIzI4Xth05k= MIME-Version: 1.0 Received: by 10.210.160.12 with SMTP id i12mr4379302ebe.9.1255365310576; Mon, 12 Oct 2009 09:35:10 -0700 (PDT) In-Reply-To: <20091009171600.GA21017@spotter-dclnx> References: <3433bf110910090911y96d3a19ga718eeb1ff4eca08@mail.gmail.com> <20091009171600.GA21017@spotter-dclnx> Date: Mon, 12 Oct 2009 18:35:10 +0200 Message-ID: <3433bf110910120935n947b76emddaad1a2cfcc844e@mail.gmail.com> Subject: Re: Getting left and right offsets of term search results From: Till Kolter To: java-user@lucene.apache.org, David Causse Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks a lot. I think TermPositionsVector will solve my problem. Although it seems to be a little inperformant Concerning the term representation: our data is way more complex then just phrasal annotation, it was just an example, because I am not allowed to talk about our internal organisation. I will inspect the Payload class, it should help me come up with a solution. On Fri, Oct 9, 2009 at 7:16 PM, David Causse wrote: > Hi, > > we also index linguistic data, but (someone correct me if I'm wrong) you > have to deal with what the lucene store is offering. > You can store > usable on the search side : > =A0- a term (TermAttribute) > =A0- the position of the term (PositionIncrementAttribute) > =A0- an arbitrary payload (PayloadAttribute) > usable when you found results : > =A0- TermVector (no attribute or OffsetAttribute and/or PositionIncrement= Attribute) > =A0- Any data you stored in a field (arbitrary data) > > OffsetAttribute are stored in TermVector (if you specified you wanted > it) you can't search data within the TermPositionVector but you can > iterate your results and ask the reader to return the TermPositionVector > for a specific document and a field. > > Lucene can't store arbitrary Attributes they are only useful in a > analyze pipe. You have to serialize (if you want to search for this > info) the data inside the term itself (eg add a char at the end of term > to describe the part of speech) and inside the Payload for position > specific info (eg a relation id, paragraph id or whatever you want :it's > a byte[]). > > With those techniques you can do many things, you have to be inventive bu= t > with payloads you can do very interesting things. > You can also store the offsets inside the payload and don't bother with > term vector! > Well there is really hundreds of solutions to deal with linguistic data > inside lucene. What is hard is when you have to deal with relations but > a triplet store should be more adapted for this. > > I suggest also to store a serialized form of your internal > representation in the index, it may be more flexible to use it versus > TermPositionvector. > > Hope it helps. > > On Fri, Oct 09, 2009 at 06:11:33PM +0200, Till Kolter wrote: >> I am quite new to Lucene, but I have searched the FAQs and consulted >> the mailinglist archive. I debugged through the source codes as well. >> >> I have writen an Analyzer, that analyzes a stream by sending it to a >> whole pipeline of linguistic processing and uses the internal >> representation to construct a TokenStream, that tokenizes chunks >> (semantic units). The Term-Attribute String hold the abstract >> representations of those units. For further uses (for instance: >> highlighting the results in text), I need access to the >> OffsetAttribute, that I defined in my TokenStream implementation. Like >> in StandardTokenizer I defined an OffsetAttribute to save the left and >> right values of the original chunks. >> >> Now I want to search for all documents containing an >> "AdjectivePhrase", get those APs from the Documents and highlight all >> APs in the found documents. >> >> I tried to find results by getting TermPositions with >> "Reader.termPositions(term)" and then iterate over the positions, but >> the positions only represent the left offset. >> >> Is there another function to get structured results from term queries >> over documents, where I can get the whole set of attributes, that I >> constructed in the TokenStream with addAttribute(Class)? I did not >> find such a function, but I guess I dont know all retrieval methods of >> Lucene, yet. For my search I used the IndexSearcher. >> >> Thanks >> Till Kolter >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> > > -- > David Causse > Spotter > http://www.spotter.com/ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org