lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Best Practices for getting Strings from a position range
Date Fri, 10 Aug 2007 17:39:37 GMT
Sorry for the confusion.  I thought you just wanted access to the  
term info per position.    I think we will have to add something to  
the Spans like we talked about before.

-Grant

On Aug 10, 2007, at 11:03 AM, Peter Keegan wrote:

> Grant,
>
> I'm afraid I don't understand how to use this mapper in the context  
> of a
> SpanQuery. It seems like I would have to modify SpanScorer to fetch  
> payload
> data and provide a new method to access the payloads while  
> iterating through
> the documents. If this can be accomplished without modifying Spans,  
> could
> you provide a bit more detail?
>
> Thanks,
> Peter
>
> On 8/9/07, Peter Keegan <peterlkeegan@gmail.com> wrote:
>>
>> Hi Grant,
>>
>> I'm hoping to check this out soon.
>>
>> Thanks,
>> Peter
>>
>> On 8/7/07, Grant Ingersoll <gsingers@apache.org > wrote:
>>>
>>> Hi Peter,
>>>
>>> Give https://issues.apache.org/jira/browse/LUCENE-975 a try.  It
>>> provides a TermVectorMapper that loads by position.
>>>
>>> Still not what ideally what you want, but I haven't had time to  
>>> scope
>>> that one out yet.,
>>>
>>> -Grant
>>>
>>> On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:
>>>
>>>> Hi Grant,
>>>>
>>>> No problem - I know you are very busy.  I just wanted to get a
>>>> sense for the
>>>> timing because I'd like to use this for a release this Fall. If I
>>>> can get a
>>>> prototype working in the coming weeks AND the performance is
>>>> great :) , this
>>>> would be terrific. If not, I'll have to fall back on a more complex
>>>> design
>>>> that handles the query outside of Lucene :(
>>>>
>>>> In the meantime, I'll try playing with LUCENE-868.
>>>>
>>>> Thanks for the update.
>>>> Peter
>>>>
>>>> On 7/24/07, Grant Ingersoll <grant.ingersoll@gmail.com > wrote:
>>>>>
>>>>> Sorry, Peter, I haven't had a chance to work on it.  I don't  
>>>>> see it
>>>>> happening this week, but maybe next.
>>>>>
>>>>> I do think the Mapper approach via TermVectors will work.  It will
>>>>> require implementing a new mapper that orders by position, but I
>>>>> don't think that is too hard.   I started on one on the LUCENE-868
>>>>> patch (version 4) but it is not complete.  Maybe you want to pick
>>>>> it up?
>>>>>
>>>>> With this approach, you would iterate your spans, when you come  
>>>>> to a
>>>>> new doc, you would load the term vector using the  
>>>>> PositionMapper, and
>>>>> then you could index into the positions for the matches in the
>>>>> document.
>>>>>
>>>>> I realize this does not cover the just wanting to get the  
>>>>> Payload at
>>>>> the match issue.  Maybe next week...
>>>>>
>>>>> Cheers,
>>>>> Grant
>>>>>
>>>>> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
>>>>>
>>>>>> Any idea on when this might be available (days, weeks...)?
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> On 7/16/07, Grant Ingersoll < gsingers@apache.org> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> : Do we have a best practice for going from, say a SpanQuery
>>>>> doc/
>>>>>>>> : position information and retrieving the actual range of
>>>>>>> positions of
>>>>>>>> : content from the Document?  Is it just to reanalyze the
>>>>> Document
>>>>>>>> : using the appropriate Analyzer and start recording once
you
>>>>>>> hit the
>>>>>>>> : positions you are interested in?    Seems like Term Vectors
>>>>>>> _could_
>>>>>>>> : help, but even my new Mapper approach patch (LUCENE-868)
>>>>> doesn't
>>>>>>>> : really help, because they are stored in a term-centric
>>>>> manner.  I
>>>>>>>> : guess what I am after is a position centric approach. 
That
>>>>>>> is, give
>>>>>>>>
>>>>>>>> this is kind of what i was suggesting in the last message
i  
>>>>>>>> sent
>>>>>>>> to the java-user thread about paylods and SpanQueries (which
 
>>>>>>>> i'm
>>>>>>>> guessing is what prompted this thread as well)...
>>>>>>>>
>>>>>>>> http://www.nabble.com/Payloads-and-PhraseQuery-
>>>>>>>> tf3988826.html#a11551628
>>>>>>>
>>>>>>>
>>>>>>> This is one use case, the other is related to the new patch I
>>>>>>> submitted for LUCENE-960.  In this case, I have a  
>>>>>>> SpanQueryFilter
>>>>>>> that identifies a bunch of docs and positions ahead of time.
  
>>>>>>> Then
>>>
>>>>>>> the user enters new Span Query and I want to relate the matches
>>>>> from
>>>>>>> the user query with the positions of matches in the filter and
>>>>> then
>>>>>>> show that window.
>>>>>>>
>>>>>>>>
>>>>>>>> my point was that currently, to retrieve a payload you need
a
>>>>>>>> TermPositions instance, which is designed for iterating in
the
>>>>>>>> order of...
>>>>>>>>     seek(term)
>>>>>>>>       skipTo(doc)
>>>>>>>>          nextPosition()
>>>>>>>>             getPayload()
>>>>>>>> ...which is great for getting the payload of every instance
>>>>>>>> (ie:position) of a specific term in a given document (or
in
>>>>> every
>>>>>>>> document) but without serious changes to the Spans API, the
>>>>> ideal
>>>>>>>> payload
>>>>>>>> API would let you say...
>>>>>>>>     skipTo(doc)
>>>>>>>>        advance(startPosition)
>>>>>>>>          getPayload()
>>>>>>>>        while (nextPosition() < endPosition)
>>>>>>>>          getPosition()
>>>>>>>>
>>>>>>>> but this seems like a nearly impossible API to implement
>>>>> given the
>>>>>>>> natore
>>>>>>>> of hte inverted index and the fact that terms aren't ever
>>>>> stored in
>>>>>>>> position order.
>>>>>>>>
>>>>>>>> there's a lot i really don't know/understand about the lucene
>>>>> term
>>>>>>>> position internals ... but as i recall, the datastructure
>>>>> written
>>>>>>>> to disk
>>>>>>>> isn't actually a tree structure inverted index, it's a long
>>>>>>>> sequence of
>>>>>>>> tuples correct?  so in theory you could scan along the tuples
>>>>>>>> untill you
>>>>>>>> find the doc you are interested in, ignoring all of the term
>>>>> info
>>>>>>>> along
>>>>>>>> the way, then whatever term you happen be on at the moment,
you
>>>>>>>> could scan
>>>>>>>> along all of the positions until you find one in the range
>>>>> you are
>>>>>>>> interested in -- assuming you do, then you record the current
>>>>> Term
>>>>>>>> (and
>>>>>>>> read your payload data if interested)
>>>>>>>
>>>>>>> I think the main issue I see is in both the payloads and the
>>>>> matching
>>>>>>> case above is that they require a document centric approach.
  
>>>>>>> And
>>>>>>> then, for each Document,
>>>>>>> you ideally want to be able to just index into an array so that
>>>>> you
>>>>>>> can go directly to the position that is needed based on
>>>>>>> Span.getStart()
>>>>>>>
>>>>>>>>
>>>>>>>> if i remember correctly, the first part of this is easy,
and
>>>>>>>> relative fast
>>>>>>>> -- i think skipTo(doc) on a TermDoc or TermPositions will
>>>>> happily
>>>>>>>> scan for
>>>>>>>> the first <term,doc> pair with the correct docId,
>>>>> irregardless of
>>>>>>>> the term
>>>>>>>> ... the only thing i'm not sure about is how efficient it
is to
>>>>>>>> loop over
>>>>>>>> nextPosition() for every term you find to see if any of them
>>>>> are in
>>>>>>>> your
>>>>>>>> range ... the best case scenerio is that the first position
>>>>>>>> returned is
>>>>>>>> above the high end of your range, in which case you can stop
>>>>>>>> immediately
>>>>>>>> and seek to the next term -- butthe worst case is that you
call
>>>>>>>> nextPosition() over an over a lot of times before you get
a
>>>>>>>> position in
>>>>>>>> (or above) your rnage .... an advancePosition(pos) that
>>>>> wokred like
>>>>>>>> seek
>>>>>>>> or skipTo might be helpful here.
>>>>>>>>
>>>>>>>> : I feel like I am missing something obvious.  I would
>>>>> suspect the
>>>>>>>> : highlighter needs to do this, but it seems to take the
>>>>> reanalyze
>>>>>>>> : approach as well (I admit, though, that I have little
>>>>> experience
>>>>>>>> with
>>>>>>>> : the highlighter.)
>>>>>>>>
>>>>>>>> as i understand it the default case is to reanalyze, but
if you
>>>>>>> have
>>>>>>>> TermFreqVector info stored with positions (ie: a
>>>>>>>> TermPositionVector) then
>>>>>>>> it can use that to construct a TokenStream by iterating over
 
>>>>>>>> all
>>>>>>>> terms and
>>>>>>>> writing them into a big array in position order (see the
>>>>>>>> TermSources class
>>>>>>>> in the highlighter)
>>>>>>>
>>>>>>>
>>>>>>> Ah, I see that now.  Thanks.
>>>>>>>>
>>>>>>>> this makes sense when highlighting because it doesn't know
what
>>>>>>>> kind of
>>>>>>>> fragmenter is going to be used so it needs the whole
>>>>> TokenStream,
>>>>>>>> but it
>>>>>>>> seems less then ideal when you are only interested in a small
>>>>>>>> number of
>>>>>>>> position ranges that you know in advance.
>>>>>>>>
>>>>>>>> : I am wondering if it would be useful to have an alternative
>>>>> Term
>>>>>>>> : Vector storage mechanism that was position centric.
>>>>> Because we
>>>>>>>> : couldn't take advantage of the lexicographic compression,
it
>>>>>>> would
>>>>>>>> : take up more disk space, but it would be a lot faster for
>>>>> these
>>>>>>>> kinds
>>>>>>>>
>>>>>>>> i'm not sure if it's really neccessary to store the data
in a
>>>>>>> position
>>>>>>>> centric manner, assuming we have a way to "seek" by position
>>>>> like i
>>>>>>>> described above -- but then again i don't really know that
>>>>> what i
>>>>>>>> described above is all that possible/practical/performant.
>>>>>>>>
>>>>>>>
>>>>>>> I suppose I could use my Mapper approach to organize things in
a
>>>>>>> position centric way now that I think about it more.  Just
>>>>> means some
>>>>>>> unpacking and repacking.  Still, probably would perform well
>>>>> enough
>>>>>>> since I can setup the correct structure on the fly.  I will
>>>>> give this
>>>>>>> a try.  Maybe even add a Mapper to do this.
>>>>>>>
>>>>>>>
>>>>>>> -Grant
>>>>>>>
>>>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>
>>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>
>>>>> ------------------------------------------------------
>>>>> Grant Ingersoll
>>>>> http://www.grantingersoll.com/
>>>>> http://lucene.grantingersoll.com
>>>>> http://www.paperoftheweek.com/
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://lucene.grantingersoll.com
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message