Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of till.kolter@googlemail.com
 designates 209.85.219.226 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=googlemail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=iOQGI6mNWi8DB5qf9TCplqBwW4NRcVFMv0Fk4P00QOcJLE6ycDLyEKqGe4v3AVGXE8
         qITNshLmk2XNJmJpvliinI4tcfDPGKuL9MBS8F9RWX/NFr4/87cXwImpSg4dCena8n7t
         EepfaA463+SJ2gyHs/jBd1Ns2BuIzI4Xth05k=
MIME-Version: 1.0
In-Reply-To: <20091009171600.GA21017@spotter-dclnx>
References: <3433bf110910090911y96d3a19ga718eeb1ff4eca08@mail.gmail.com>
	 <20091009171600.GA21017@spotter-dclnx>
Date: Mon, 12 Oct 2009 18:35:10 +0200
Message-ID: <3433bf110910120935n947b76emddaad1a2cfcc844e@mail.gmail.com>
Subject: Re: Getting left and right offsets of term search results
From: Till Kolter <till.kolter@googlemail.com>
To: java-user@lucene.apache.org, David Causse <dcausse@spotter.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thanks a lot. I think TermPositionsVector will solve my problem.
Although it seems to be a little inperformant

Concerning the term representation: our data is way more complex then
just phrasal annotation, it was just an example, because I am not
allowed to talk about our internal organisation. I will inspect the
Payload class, it should help me come up with a solution.


On Fri, Oct 9, 2009 at 7:16 PM, David Causse <dcausse@spotter.com> wrote:
> Hi,
>
> we also index linguistic data, but (someone correct me if I'm wrong) you
> have to deal with what the lucene store is offering.
> You can store
> usable on the search side :
> =A0- a term (TermAttribute)
> =A0- the position of the term (PositionIncrementAttribute)
> =A0- an arbitrary payload (PayloadAttribute)
> usable when you found results :
> =A0- TermVector (no attribute or OffsetAttribute and/or PositionIncrement=
Attribute)
> =A0- Any data you stored in a field (arbitrary data)
>
> OffsetAttribute are stored in TermVector (if you specified you wanted
> it) you can't search data within the TermPositionVector but you can
> iterate your results and ask the reader to return the TermPositionVector
> for a specific document and a field.
>
> Lucene can't store arbitrary Attributes they are only useful in a
> analyze pipe. You have to serialize (if you want to search for this
> info) the data inside the term itself (eg add a char at the end of term
> to describe the part of speech) and inside the Payload for position
> specific info (eg a relation id, paragraph id or whatever you want :it's
> a byte[]).
>
> With those techniques you can do many things, you have to be inventive bu=
t
> with payloads you can do very interesting things.
> You can also store the offsets inside the payload and don't bother with
> term vector!
> Well there is really hundreds of solutions to deal with linguistic data
> inside lucene. What is hard is when you have to deal with relations but
> a triplet store should be more adapted for this.
>
> I suggest also to store a serialized form of your internal
> representation in the index, it may be more flexible to use it versus
> TermPositionvector.
>
> Hope it helps.
>
> On Fri, Oct 09, 2009 at 06:11:33PM +0200, Till Kolter wrote:
>> I am quite new to Lucene, but I have searched the FAQs and consulted
>> the mailinglist archive. I debugged through the source codes as well.
>>
>> I have writen an Analyzer, that analyzes a stream by sending it to a
>> whole pipeline of linguistic processing and uses the internal
>> representation to construct a TokenStream, that tokenizes chunks
>> (semantic units). The Term-Attribute String hold the abstract
>> representations of those units. For further uses (for instance:
>> highlighting the results in text), I need access to the
>> OffsetAttribute, that I defined in my TokenStream implementation. Like
>> in StandardTokenizer I defined an OffsetAttribute to save the left and
>> right values of the original chunks.
>>
>> Now I want to search for all documents containing an
>> "AdjectivePhrase", get those APs from the Documents and highlight all
>> APs in the found documents.
>>
>> I tried to find results by getting TermPositions with
>> "Reader.termPositions(term)" and then iterate over the positions, but
>> the positions only represent the left offset.
>>
>> Is there another function to get structured results from term queries
>> over documents, where I can get the whole set of attributes, that I
>> constructed in the TokenStream with addAttribute(Class)? I did not
>> find such a function, but I guess I dont know all retrieval methods of
>> Lucene, yet. For my search I used the IndexSearcher.
>>
>> Thanks
>> Till Kolter
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> --
> David Causse
> Spotter
> http://www.spotter.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org