lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Best Practices for getting Strings from a position range
Date Wed, 08 Aug 2007 02:37:36 GMT
Hi Peter,

Give https://issues.apache.org/jira/browse/LUCENE-975 a try.  It  
provides a TermVectorMapper that loads by position.

Still not what ideally what you want, but I haven't had time to scope  
that one out yet.,

-Grant

On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:

> Hi Grant,
>
> No problem - I know you are very busy.  I just wanted to get a  
> sense for the
> timing because I'd like to use this for a release this Fall. If I  
> can get a
> prototype working in the coming weeks AND the performance is  
> great :) , this
> would be terrific. If not, I'll have to fall back on a more complex  
> design
> that handles the query outside of Lucene :(
>
> In the meantime, I'll try playing with LUCENE-868.
>
> Thanks for the update.
> Peter
>
> On 7/24/07, Grant Ingersoll <grant.ingersoll@gmail.com> wrote:
>>
>> Sorry, Peter, I haven't had a chance to work on it.  I don't see it
>> happening this week, but maybe next.
>>
>> I do think the Mapper approach via TermVectors will work.  It will
>> require implementing a new mapper that orders by position, but I
>> don't think that is too hard.   I started on one on the LUCENE-868
>> patch (version 4) but it is not complete.  Maybe you want to pick  
>> it up?
>>
>> With this approach, you would iterate your spans, when you come to a
>> new doc, you would load the term vector using the PositionMapper, and
>> then you could index into the positions for the matches in the  
>> document.
>>
>> I realize this does not cover the just wanting to get the Payload at
>> the match issue.  Maybe next week...
>>
>> Cheers,
>> Grant
>>
>> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
>>
>> > Any idea on when this might be available (days, weeks...)?
>> >
>> > Peter
>> >
>> > On 7/16/07, Grant Ingersoll <gsingers@apache.org> wrote:
>> >>
>> >>
>> >> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
>> >>
>> >> >
>> >> > : Do we have a best practice for going from, say a SpanQuery  
>> doc/
>> >> > : position information and retrieving the actual range of
>> >> positions of
>> >> > : content from the Document?  Is it just to reanalyze the  
>> Document
>> >> > : using the appropriate Analyzer and start recording once you
>> >> hit the
>> >> > : positions you are interested in?    Seems like Term Vectors
>> >> _could_
>> >> > : help, but even my new Mapper approach patch (LUCENE-868)  
>> doesn't
>> >> > : really help, because they are stored in a term-centric  
>> manner.  I
>> >> > : guess what I am after is a position centric approach.  That
>> >> is, give
>> >> >
>> >> > this is kind of what i was suggesting in the last message i sent
>> >> > to the java-user thread about paylods and SpanQueries (which i'm
>> >> > guessing is what prompted this thread as well)...
>> >> >
>> >> > http://www.nabble.com/Payloads-and-PhraseQuery-
>> >> > tf3988826.html#a11551628
>> >>
>> >>
>> >> This is one use case, the other is related to the new patch I
>> >> submitted for LUCENE-960.  In this case, I have a SpanQueryFilter
>> >> that identifies a bunch of docs and positions ahead of time.  Then
>> >> the user enters new Span Query and I want to relate the matches  
>> from
>> >> the user query with the positions of matches in the filter and  
>> then
>> >> show that window.
>> >>
>> >> >
>> >> > my point was that currently, to retrieve a payload you need a
>> >> > TermPositions instance, which is designed for iterating in the
>> >> > order of...
>> >> >     seek(term)
>> >> >       skipTo(doc)
>> >> >          nextPosition()
>> >> >             getPayload()
>> >> > ...which is great for getting the payload of every instance
>> >> > (ie:position) of a specific term in a given document (or in  
>> every
>> >> > document) but without serious changes to the Spans API, the  
>> ideal
>> >> > payload
>> >> > API would let you say...
>> >> >     skipTo(doc)
>> >> >        advance(startPosition)
>> >> >          getPayload()
>> >> >        while (nextPosition() < endPosition)
>> >> >          getPosition()
>> >> >
>> >> > but this seems like a nearly impossible API to implement  
>> given the
>> >> > natore
>> >> > of hte inverted index and the fact that terms aren't ever  
>> stored in
>> >> > position order.
>> >> >
>> >> > there's a lot i really don't know/understand about the lucene  
>> term
>> >> > position internals ... but as i recall, the datastructure  
>> written
>> >> > to disk
>> >> > isn't actually a tree structure inverted index, it's a long
>> >> > sequence of
>> >> > tuples correct?  so in theory you could scan along the tuples
>> >> > untill you
>> >> > find the doc you are interested in, ignoring all of the term  
>> info
>> >> > along
>> >> > the way, then whatever term you happen be on at the moment, you
>> >> > could scan
>> >> > along all of the positions until you find one in the range  
>> you are
>> >> > interested in -- assuming you do, then you record the current  
>> Term
>> >> > (and
>> >> > read your payload data if interested)
>> >>
>> >> I think the main issue I see is in both the payloads and the  
>> matching
>> >> case above is that they require a document centric approach.  And
>> >> then, for each Document,
>> >> you ideally want to be able to just index into an array so that  
>> you
>> >> can go directly to the position that is needed based on
>> >> Span.getStart()
>> >>
>> >> >
>> >> > if i remember correctly, the first part of this is easy, and
>> >> > relative fast
>> >> > -- i think skipTo(doc) on a TermDoc or TermPositions will  
>> happily
>> >> > scan for
>> >> > the first <term,doc> pair with the correct docId,  
>> irregardless of
>> >> > the term
>> >> > ... the only thing i'm not sure about is how efficient it is to
>> >> > loop over
>> >> > nextPosition() for every term you find to see if any of them  
>> are in
>> >> > your
>> >> > range ... the best case scenerio is that the first position
>> >> > returned is
>> >> > above the high end of your range, in which case you can stop
>> >> > immediately
>> >> > and seek to the next term -- butthe worst case is that you call
>> >> > nextPosition() over an over a lot of times before you get a
>> >> > position in
>> >> > (or above) your rnage .... an advancePosition(pos) that  
>> wokred like
>> >> > seek
>> >> > or skipTo might be helpful here.
>> >> >
>> >> > : I feel like I am missing something obvious.  I would  
>> suspect the
>> >> > : highlighter needs to do this, but it seems to take the  
>> reanalyze
>> >> > : approach as well (I admit, though, that I have little  
>> experience
>> >> > with
>> >> > : the highlighter.)
>> >> >
>> >> > as i understand it the default case is to reanalyze, but if you
>> >> have
>> >> > TermFreqVector info stored with positions (ie: a
>> >> > TermPositionVector) then
>> >> > it can use that to construct a TokenStream by iterating over all
>> >> > terms and
>> >> > writing them into a big array in position order (see the
>> >> > TermSources class
>> >> > in the highlighter)
>> >>
>> >>
>> >> Ah, I see that now.  Thanks.
>> >> >
>> >> > this makes sense when highlighting because it doesn't know what
>> >> > kind of
>> >> > fragmenter is going to be used so it needs the whole  
>> TokenStream,
>> >> > but it
>> >> > seems less then ideal when you are only interested in a small
>> >> > number of
>> >> > position ranges that you know in advance.
>> >> >
>> >> > : I am wondering if it would be useful to have an alternative  
>> Term
>> >> > : Vector storage mechanism that was position centric.   
>> Because we
>> >> > : couldn't take advantage of the lexicographic compression, it
>> >> would
>> >> > : take up more disk space, but it would be a lot faster for  
>> these
>> >> > kinds
>> >> >
>> >> > i'm not sure if it's really neccessary to store the data in a
>> >> position
>> >> > centric manner, assuming we have a way to "seek" by position  
>> like i
>> >> > described above -- but then again i don't really know that  
>> what i
>> >> > described above is all that possible/practical/performant.
>> >> >
>> >>
>> >> I suppose I could use my Mapper approach to organize things in a
>> >> position centric way now that I think about it more.  Just  
>> means some
>> >> unpacking and repacking.  Still, probably would perform well  
>> enough
>> >> since I can setup the correct structure on the fly.  I will  
>> give this
>> >> a try.  Maybe even add a Mapper to do this.
>> >>
>> >>
>> >> -Grant
>> >>
>> >>  
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >>
>>
>> ------------------------------------------------------
>> Grant Ingersoll
>> http://www.grantingersoll.com/
>> http://lucene.grantingersoll.com
>> http://www.paperoftheweek.com/
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message