lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Kolter <>
Subject Re: Getting left and right offsets of term search results
Date Mon, 12 Oct 2009 16:35:10 GMT
Thanks a lot. I think TermPositionsVector will solve my problem.
Although it seems to be a little inperformant

Concerning the term representation: our data is way more complex then
just phrasal annotation, it was just an example, because I am not
allowed to talk about our internal organisation. I will inspect the
Payload class, it should help me come up with a solution.

On Fri, Oct 9, 2009 at 7:16 PM, David Causse <> wrote:
> Hi,
> we also index linguistic data, but (someone correct me if I'm wrong) you
> have to deal with what the lucene store is offering.
> You can store
> usable on the search side :
>  - a term (TermAttribute)
>  - the position of the term (PositionIncrementAttribute)
>  - an arbitrary payload (PayloadAttribute)
> usable when you found results :
>  - TermVector (no attribute or OffsetAttribute and/or PositionIncrementAttribute)
>  - Any data you stored in a field (arbitrary data)
> OffsetAttribute are stored in TermVector (if you specified you wanted
> it) you can't search data within the TermPositionVector but you can
> iterate your results and ask the reader to return the TermPositionVector
> for a specific document and a field.
> Lucene can't store arbitrary Attributes they are only useful in a
> analyze pipe. You have to serialize (if you want to search for this
> info) the data inside the term itself (eg add a char at the end of term
> to describe the part of speech) and inside the Payload for position
> specific info (eg a relation id, paragraph id or whatever you want :it's
> a byte[]).
> With those techniques you can do many things, you have to be inventive but
> with payloads you can do very interesting things.
> You can also store the offsets inside the payload and don't bother with
> term vector!
> Well there is really hundreds of solutions to deal with linguistic data
> inside lucene. What is hard is when you have to deal with relations but
> a triplet store should be more adapted for this.
> I suggest also to store a serialized form of your internal
> representation in the index, it may be more flexible to use it versus
> TermPositionvector.
> Hope it helps.
> On Fri, Oct 09, 2009 at 06:11:33PM +0200, Till Kolter wrote:
>> I am quite new to Lucene, but I have searched the FAQs and consulted
>> the mailinglist archive. I debugged through the source codes as well.
>> I have writen an Analyzer, that analyzes a stream by sending it to a
>> whole pipeline of linguistic processing and uses the internal
>> representation to construct a TokenStream, that tokenizes chunks
>> (semantic units). The Term-Attribute String hold the abstract
>> representations of those units. For further uses (for instance:
>> highlighting the results in text), I need access to the
>> OffsetAttribute, that I defined in my TokenStream implementation. Like
>> in StandardTokenizer I defined an OffsetAttribute to save the left and
>> right values of the original chunks.
>> Now I want to search for all documents containing an
>> "AdjectivePhrase", get those APs from the Documents and highlight all
>> APs in the found documents.
>> I tried to find results by getting TermPositions with
>> "Reader.termPositions(term)" and then iterate over the positions, but
>> the positions only represent the left offset.
>> Is there another function to get structured results from term queries
>> over documents, where I can get the whole set of attributes, that I
>> constructed in the TokenStream with addAttribute(Class)? I did not
>> find such a function, but I guess I dont know all retrieval methods of
>> Lucene, yet. For my search I used the IndexSearcher.
>> Thanks
>> Till Kolter
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> --
> David Causse
> Spotter
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message