lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Keegan" <peterlkee...@gmail.com>
Subject Re: Best Practices for getting Strings from a position range
Date Mon, 23 Jul 2007 12:51:42 GMT
Any idea on when this might be available (days, weeks...)?

Peter

On 7/16/07, Grant Ingersoll <gsingers@apache.org> wrote:
>
>
> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
>
> >
> > : Do we have a best practice for going from, say a SpanQuery doc/
> > : position information and retrieving the actual range of positions of
> > : content from the Document?  Is it just to reanalyze the Document
> > : using the appropriate Analyzer and start recording once you hit the
> > : positions you are interested in?    Seems like Term Vectors _could_
> > : help, but even my new Mapper approach patch (LUCENE-868) doesn't
> > : really help, because they are stored in a term-centric manner.  I
> > : guess what I am after is a position centric approach.  That is, give
> >
> > this is kind of what i was suggesting in the last message i sent
> > to the java-user thread about paylods and SpanQueries (which i'm
> > guessing is what prompted this thread as well)...
> >
> > http://www.nabble.com/Payloads-and-PhraseQuery-
> > tf3988826.html#a11551628
>
>
> This is one use case, the other is related to the new patch I
> submitted for LUCENE-960.  In this case, I have a SpanQueryFilter
> that identifies a bunch of docs and positions ahead of time.  Then
> the user enters new Span Query and I want to relate the matches from
> the user query with the positions of matches in the filter and then
> show that window.
>
> >
> > my point was that currently, to retrieve a payload you need a
> > TermPositions instance, which is designed for iterating in the
> > order of...
> >     seek(term)
> >       skipTo(doc)
> >          nextPosition()
> >             getPayload()
> > ...which is great for getting the payload of every instance
> > (ie:position) of a specific term in a given document (or in every
> > document) but without serious changes to the Spans API, the ideal
> > payload
> > API would let you say...
> >     skipTo(doc)
> >        advance(startPosition)
> >          getPayload()
> >        while (nextPosition() < endPosition)
> >          getPosition()
> >
> > but this seems like a nearly impossible API to implement given the
> > natore
> > of hte inverted index and the fact that terms aren't ever stored in
> > position order.
> >
> > there's a lot i really don't know/understand about the lucene term
> > position internals ... but as i recall, the datastructure written
> > to disk
> > isn't actually a tree structure inverted index, it's a long
> > sequence of
> > tuples correct?  so in theory you could scan along the tuples
> > untill you
> > find the doc you are interested in, ignoring all of the term info
> > along
> > the way, then whatever term you happen be on at the moment, you
> > could scan
> > along all of the positions until you find one in the range you are
> > interested in -- assuming you do, then you record the current Term
> > (and
> > read your payload data if interested)
>
> I think the main issue I see is in both the payloads and the matching
> case above is that they require a document centric approach.  And
> then, for each Document,
> you ideally want to be able to just index into an array so that you
> can go directly to the position that is needed based on Span.getStart()
>
> >
> > if i remember correctly, the first part of this is easy, and
> > relative fast
> > -- i think skipTo(doc) on a TermDoc or TermPositions will happily
> > scan for
> > the first <term,doc> pair with the correct docId, irregardless of
> > the term
> > ... the only thing i'm not sure about is how efficient it is to
> > loop over
> > nextPosition() for every term you find to see if any of them are in
> > your
> > range ... the best case scenerio is that the first position
> > returned is
> > above the high end of your range, in which case you can stop
> > immediately
> > and seek to the next term -- butthe worst case is that you call
> > nextPosition() over an over a lot of times before you get a
> > position in
> > (or above) your rnage .... an advancePosition(pos) that wokred like
> > seek
> > or skipTo might be helpful here.
> >
> > : I feel like I am missing something obvious.  I would suspect the
> > : highlighter needs to do this, but it seems to take the reanalyze
> > : approach as well (I admit, though, that I have little experience
> > with
> > : the highlighter.)
> >
> > as i understand it the default case is to reanalyze, but if you have
> > TermFreqVector info stored with positions (ie: a
> > TermPositionVector) then
> > it can use that to construct a TokenStream by iterating over all
> > terms and
> > writing them into a big array in position order (see the
> > TermSources class
> > in the highlighter)
>
>
> Ah, I see that now.  Thanks.
> >
> > this makes sense when highlighting because it doesn't know what
> > kind of
> > fragmenter is going to be used so it needs the whole TokenStream,
> > but it
> > seems less then ideal when you are only interested in a small
> > number of
> > position ranges that you know in advance.
> >
> > : I am wondering if it would be useful to have an alternative Term
> > : Vector storage mechanism that was position centric.  Because we
> > : couldn't take advantage of the lexicographic compression, it would
> > : take up more disk space, but it would be a lot faster for these
> > kinds
> >
> > i'm not sure if it's really neccessary to store the data in a position
> > centric manner, assuming we have a way to "seek" by position like i
> > described above -- but then again i don't really know that what i
> > described above is all that possible/practical/performant.
> >
>
> I suppose I could use my Mapper approach to organize things in a
> position centric way now that I think about it more.  Just means some
> unpacking and repacking.  Still, probably would perform well enough
> since I can setup the correct structure on the fly.  I will give this
> a try.  Maybe even add a Mapper to do this.
>
>
> -Grant
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message